PROTEIN  SEQUENCE  DIVERGENCE  RELATIVE  TO  PROTEIN  FOLDING 


By 

DANNY  W.  DE  KEE 


A DISSERTATION  PRESENTED  TO  THE  GRADUATE  SCHOOL 
OF  THE  UNIVERSITY  OF  FLORIDA  IN  PARTIAL  FULFILLMENT 
OF  THE  REQUIREMENTS  FOR  THE  DEGREE  OF 
DOCTOR  OF  PHILOSOPHY 

UNIVERSITY  OF  FLORIDA 


2004 


Copyright  2004 
by 


Danny  W.  De  Kee 


ACKNOWLEDGMENTS 


Throughout  the  journey  of  life,  I have  had  the  honor  of  being  surrounded  by 
exceptional  people.  Without  their  guidance  and  friendship,  the  completion  of  this  stage 
in  my  journey  would  not  have  been  possible.  First  I must  thank  my  parents  for  allowing 
me  to  pursue  my  goals  and  for  remaining  supportive  of  the  decisions  I made  to  reach 
those  goals.  There  is  no  adequate  way  to  completely  acknowledge  their  role  in  forming 
the  man  I am  today  but  to  say  that  I am  proud  to  be  their  son. 

I would  like  to  express  my  gratitude  to  my  advisor,  Dr.  Steven  Benner,  for  his 
guidance  and  support  during  the  past  few  years.  I would  also  like  to  thank  Raphael 
LaFrance  for  his  tremendous  computational  assistance. 


iii 


TABLE  OF  CONTENTS 


page 

ACKNOWLEDGMENTS iii 

LIST  OF  TABLES vi 

LIST  OF  FIGURES viii 

ABSTRACT x 

1 INTRODUCTION 1 

Classical  Structure  Prediction 4 

Theoretical  Methods 5 

Empirical  Methods 7 

Statistical 7 

Sequence  similarity  methods 9 

Evolution-Based  Structure  Prediction 10 

Understanding  the  Details  of  Molecular  Evolution 11 

Alignment 1 1 

Substitution  Matrices 12 

Gaps  in  an  Alignment 14 

Parsing  Strings 15 

Neutral  versus  Adaptive  Variation 15 

Models  of  Protein  Sequence  Divergence 1 6 

2 CREATION  OF  THE  DATABASE 18 

Introduction 18 

Construction  of  the  Database 18 

3 PREDICTING  SEGMENTS  OF  SECONDARY  STRUCTURE 25 

Introduction 25 

Weaknesses  of  Empirical  Statistical  Methods 25 

Experimental  Host-Guest  Approach 27 

Methods  and  Materials 29 

Results 31 


IV 


Discussion 33 

Prediction  of  Internal  Helices 36 


4  IDENTIFYING  PARSES  THAT  SEPARATE  SECONDARY  STRUCTURAL 
UNITS:  IMPLICATIONS  FOR  PREDICTING  PROTEIN  SECONDARY 


STRUCTURE 41 

Introduction 41 

Methods  and  Materials 43 

Results 45 

Single  Residue  Parses 45 

Dipeptide  Parses 46 

Gap  Parses 49 

Conclusion 52 

5  EVALUATION  OF  CASP  V 54 

Introduction 54 

Evaluation  of  a Secondary  Structure  Prediction 55 

CASP  V 61 

General  Trends 65 

Examination  of  Predictions 67 

T1029 68 

Results  for  TO  129 74 

T0148 81 

Results  for  T0148 87 

TO  149 91 

Results  forT0149 120 

Conclusion 125 

Future  Work 126 


6  CONCLUSION 131 

APPENDIX  A SUPPLEMENTARY  MATERIAL  FOR  THE  CREATION  OF  THE 
DATABASE 132 

APPENDIX  B MOTIF  DESCRIPTION 146 

REFERENCES 149 

BIOGRAPHICAL  SKETCH 157 


v 


LIST  OF  TABLES 


Table  page 

2- 1 . Frequencies  of  the  standard  amino  acids  in  the  protein  sequence  database 24 

3- 1 . Tabulation  of  the  number  of  helices,  the  number  of  internal  and  surface  residues  and 

the  percentage  of  helices  that  follow  a helical  wheel 32 

3-2.  Propensity  of  the  amino  acids  within  helices  where  all  side  chains  are  internal 

(1400) 33 

3-3.  Propensity  of  the  amino  acids  within  helices  where  not  all  side  chains  are  internal 
(4497) 34 

3-4.  Chou-Fasman  (1974)  assignment  of  amino  acid  propensities 35 

3-5.  Change  in  propensity  from  internal  to  non-internal  helices 35 

3- 6.  Number  of  Core  and  Non-core  internal  helices  as  well  as  their  respective  accessible 

surface  areas  and  lengths 40 

4- 1 . Single  residue  parsing  probabilities 45 

4-2.  Frequencies  of  dipeptides  in  parses,  helices  and  strands 47 

4-3.  Reliability  of  gaps  as  an  indicator  of  parsing  elements 50 

4-4.  Coverage,  by  residue,  of  gaps  as  an  indicator  of  parsing  elements 51 

4- 5.  Coverage,  by  residue,  of  gaps  as  an  indicator  of  parsing  elements 51 

5- 1 . Summary  of  Prediction  Targets  for  the  CASP  V ab  initio  Project 63 

5-2.  Summary  of  Results  for  the  CASP  V ab  initio  Project 64 

5-3.  Summary  of  Baker’s  results  for  the  CASP  V ab  initio  Project 64 

5-4.  Dihedral  angles  for  segment  F in  T0129 80 

5-5.  PHD  results  for  TO  129 81 

5-6.  Segment  1 106 


vi 


107 


5-7.  Segment  2 

5-8.  Segment  4 108 

5-9.  Segment  5 109 

5-10.  Segment  6 110 

5-11.  Segment  7 Ill 

5-12.  Segment  9 112 

5-13.  Segment  10 112 

5-14.  Segment  11 113 

5-15.  Segment  13 114 

5-16.  Segment  14 115 

5-17.  Segment  16 116 

5-18.  Segment  18 118 

5-19.  Segment  20 119 

A-l . List  of  gi  identifiers  used  in  this  study 134 

B-l.  Individual  helices  for  TO  129,  PDB  id:  lizm 147 

B-2.  Helix  interactions  involving  lizm 147 

B-3.  Individual  helices  for  T0148,  PDB  id:  linO 147 

B-4.  Helix  interactions  involving  linO 147 

B-5.  Individual  strands  forT0148,  PDB  id:  linO 148 

B-6.  Individual  helices  for  T0149,  PDB  id:  linj 148 

B-7.  Helix  interactions  involving  linj 148 

B-8.  Individual  strands  for  T0149,  PDB  id:  linj 148 

vii 


LIST  OF  FIGURES 


Figure  page 

1-1.  Two  rotational  degrees  of  freedom  in  an  amino  acid,  designated  by  the  dihedral 

angles  (f)  and  y/,  give  a peptide  chain  its  flexibility 2 

1-2.  Schiffer-Edmundson  helical  wheel  showing  the  position  of  hydrophobic  and 

hydrophilic  amino  acids 6 

1- 3.  The  log  odds  matrix  for  PAM250  (multiplied  by  10) 13 

2- 1 . Example  of  a family  from  the  protein  sequence  database 23 

3- 1 . Plot  of  number  of  helices  versus  helix  length  in  the  reference  protein  database 3 1 

3- 2.  Accuracy  of  predicting  residues  in  helices  (or  entire  helical  segments)  is  plotted  vs. 

the  observed  relative  solvent  accessibility  (according  to  DSSP  assignments) 38 

4- 1.  A strand  of  protein  in  the  a)  linear,  frilly  extended  conformation,  and  (b)  the  same 

strand  with  a proline  in  the  middle 53 

5- 1 . Ramachandran  plot  showing  the  (arbitrary)  boundaries  between  values  of  (j)  and  iff 

that  indicate  a helices,  / 3 strands,  and  coils  (the  remainder  of  the  diagram) 57 

5-2.  Plot  of  the  Qi  scores  versus  the  number  of  proteins  in  the  family 65 

5-3.  The  alignment  for  Target  T0129 70 

5-4.  Phylogenetic  trees  for  Target  T0129,  MC  familyl8669 71 

5-5.  Schiffer-Edmundson  helical  wheels  show  3.6-residue  periodicity  in  surface  (s)  and 
interior  (i)  assignments  for  five  segments  of  Target  TO  129 72 

5-6.  Ribbon  representation  of  TO  129 74 

5-7.  Alignment  for  Target  T0129 76 

5-8.  Phylogenetic  trees  for  Target  T0148,  MC  family7610 82 

5-9.  The  alignment  for  Target  T0148 83 

5-10.  The  helical  wheel  segments  of  TO  148 85 

viii 


5-1 1 . Ribbon  representation  of  the  YajQ  monomer 87 

5-12.  The  alignment  for  Target  T0148 88 

5-13.  Phylogenetic  trees  for  the  5 subfamilies  of  TO  149 91 

5-14.  Multiple  sequence  alignments  for  Subfamily  T0149 94 

5-15.  Segment  3 108 

5-16.  Helical  wheel  showing  the  lack  of  amphiphatic  behavior  for  Subfamily  3 109 

5-17.  Segment  8 Ill 

5-18.  Helical  wheel  showing  the  lack  of  amphiphatic  behavior  for  Subfamily  3 1 12 

5-19.  Segment  12 113 

5-20.  Helical  wheel  showing  the  potential  for  an  internal  helix 114 

5-21.  Segment  15 116 

5-22.  Helical  wheel  for  positions  182-190  from  Subfamily  1 116 

5-23.  Segment  17 117 

5-24.  Segment  19 119 

5-25.  Segment  21 120 

5-26.  Ribbon  representation  of  the  T0149 121 

5-27.  The  alignment  for  Target  TO  149 121 

A-l.  GetSeqIds.pl 133 

A-2.  PrepareSeqIds.pl 138 

A-3.  darwin.ma 140 

A-4.  GetDSSP.pl 143 

A-5.  Master.pl 145 


IX 


Abstract  of  Dissertation  Presented  to  the  Graduate  School 
of  the  University  of  Florida  in  Partial  Fulfillment  of  the 
Requirements  for  the  Degree  of  Doctor  of  Philosophy 

PROTEIN  SEQUENCE  DIVERGENCE  RELATIVE  TO  PROTEIN  FOLDING 

By 

Danny  W.  De  Kee 
May  2004 

Chair:  Steven  A.  Benner 
Major  Department:  Chemistry 

This  dissertation  reports  studies  on  protein  sequence  divergence  relative  to  protein 
folding.  A protein  database  was  created  containing  multiple  sequence  alignments  of 
homologous  families  and  their  corresponding  secondary  structure  information.  This 
database  was  queried  to  learn  how  the  sequences  of  natural  proteins  might  be  analyzed  to 
predict  parses,  which  are  breaks  in  standard  secondary  structural  units  (helices  and 
strands)  in  the  folded  form.  Specific  strings  of  consecutive  amino  acids  in  a polypeptide 
chain  were  identified  that  improved  assignments  of  parses  based  on  single-residue 
analysis. 

Aspects  of  protein  conformation  were  analyzed  by  assigning  interior  and  surface 
residues  from  patterns  of  variation  and  conservation  in  homologous  protein  sequences. 
This  study  improved  the  ability  to  predict  secondary  structural  units. 

These  heuristics  are  described  in  detail  and  their  performance  evaluated  when 
applied  to  ten  protein  families  with  known  three-dimensional  structures.  The  use  of  these 
heuristics  is  discussed  in  the  context  of  protein  secondary  structure  prediction. 


x 


CHAPTER  1 
INTRODUCTION 

One  of  the  most  significant  problems  in  biological  research  today  is  the  prediction 
of  protein  structure  from  knowledge  of  the  primary  amino  acid  sequence.  Large-scale 
genome-sequencing  efforts  have  made  this  problem  even  more  significant,  since  the 
growth  of  sequences  has  easily  outpaced  the  elucidation  of  protein  structures.  Therefore, 
computational  approaches  are  needed  to  determine  the  folded  structure  of  protein 
sequences  for  which  experimental  data  are  not  yet  available.  A number  of  real-world 
applications  benefit  from  knowledge  of  the  protein  structure,  including  the  discovery  of 
mechanisms  and  structure-based  drug  design.  It  is  conceivable  that  predicted  structures 
could  also  confer  such  benefits. 

The  holy  grail  of  protein  structure  prediction  would  be  an  algorithm  that  accepted  a 
sequence  of  amino  acids  and  returned  the  secondary  and  tertiary  structure  encoded  within 
those  residues.  However,  such  ab  initio  prediction  is  not  currently  possible  due  to  lack  of 
knowledge  of  underlying  protein  folding  mechanisms  as  well  as  computational  power. 
Protein  structure  prediction,  also  referred  to  as  the  protein  folding  problem,  is  difficult  for 
many  reasons  all  of  which  are  important  as  we  consider  how  it  might  be  solved.  These 
difficulties  are  discussed  in  the  following  text. 

First,  proteins  are  big,  especially  when  compared  with  the  molecules  that  have  long 
been  the  focus  of  conformational  analysis  in  organic  chemistry.  Proteins  typically 
contain  100-1000  amino  acids,  or  1000-20000  atoms.  Every  peptide  unit  in  the 
polypeptide  chain  has  two  rotational  degrees  of  freedom  (Figure  1-1),  assuming  that  the 


1 


2 


amide  bond  itself  is  planar  and  lies  exclusively  in  the  trans  conformation.  One  degree  of 
rotational  freedom  is  around  the  bond  joining  the  carbonyl  carbon  and  the  a carbon  of  the 
amino  acid.  The  second  is  around  the  bond  joining  the  a carbon  and  the  nitrogen.  These 
are  often  known  as  the  ^and  jangles  (Ramachandran  and  Sasisekharan  1968). 
Flexibility  in  the  side  chains  adds  additional  rotational  degrees  of  freedom  to  the 
molecule.  Together,  these  make  the  conformational  energy  surfaces  associated  with 
protein  sequences  enormous. 


Figure  1-1.  Two  rotational  degrees  of  freedom  in  an  amino  acid,  designated  by  the 
dihedral  angles  (j)  and  y/,  give  a peptide  chain  its  flexibility. 

Conformational  prediction  for  all  molecules  has  long  been  difficult.  The  protein 

conformation  problem  is  intricately  connected  with  questions  that  lie  at  the  heart  of 

physical  chemistry:  How  do  we  describe  the  interaction  of  two  molecules  with  each 


3 


other?  How  do  we  describe  the  interaction  of  ensembles  of  molecules?  Answers  for  these 
questions  for  simpler  systems  have  not  yet  been  found,  although  impressive  progress  has 
been  made  in  this  area  in  the  past  decade  (Levitt  1992,  Schiffer  et  al.  1992,  Park  and 
Levitt  1996,  Hao  and  Scheraga  1996,  Fratemali  and  van  Gunsteren  1996).  Currently 
there  is  no  method,  automated  or  manual,  parameterized  or  ab  initio,  that  precisely 
predicts  the  conformation  of  any  organic  molecule  in  solution.  Conformation  is 
especially  poorly  understood  in  strongly  interacting  solvents  such  as  water,  the 
environment  where  most  globular  proteins  exist  physiologically. 

If  this  were  not  sufficient,  evolutionary  issues  unique  to  biological  molecules  such 
as  proteins  suggest  that  predicting  conformation  should  be  especially  difficult  (Benner 
1989).  Natural  selection  seeks  biomolecules  that  contribute  to  survival,  mate  selection, 
and  reproduction  in  their  host  organism.  A protein  with  high  conformational  stability  is 
rarely  desired  by  natural  selection,  if  only  because  a cell  living  in  a changing 
environment  is  continually  degrading  proteins  to  reuse  their  constituent  amino  acids  to 
make  new  proteins.  Thus,  natural  selection  typically  seeks  a protein  that  unfolds  at  a 
temperature  only  a few  degrees  higher  than  the  physiological  temperature  for  an 
organism  (Benner  1989). 

However,  if  a protein  obeys  all  of  the  rules  of  folding,  conformational  stability  is 
possible  (Benner  1989).  The  conformational  stability  of  proteins  from  thermophiles,  the 
ease  with  which  point  mutation  can  increase  conformational  stability,  and  the  insolubility 
of  a typical  peptide  (remembering  that  precipitation,  where  a peptide  interacts  with  other 
peptides  rather  than  with  solvent,  is  a folding  process)  is  evidence  for  this.  Thus, 
selective  pressures  create  proteins  that  are  conformationally  unstable  relative  to  the 


4 


stability  that  could  be  achieved  if  a protein  were  to  exploit  all  of  the  stabilizing 
interactions  available  to  a typical  polypeptide  chain  (Benner  1989).  This  implies  that 
natural  proteins  violate  folding  rules  to  achieve  a desired  level  of  instability.  This,  in 
turn,  implies  that  even  if  the  chemist  learns  the  rules  that  confer  conformational  stability 
on  small  molecules,  and  can  apply  them  to  large  molecules  such  as  proteins,  natural 
protein  sequences  will  deceive  the  chemist  attempting  to  apply  these  rules  to  predict  their 
conformations. 

Classical  Structure  Prediction 

Discussions  of  conformation  in  proteins  began  immediately  after  the  first  proteins 
were  sequenced.  A daring  attempt  by  Scheraga  (1960)  to  predict  the  conformation  of 
ribonuclease  as  early  as  1 960,  based  on  a variety  of  experimental  and  theoretical 
considerations,  is  especially  noteworthy,  if  only  because  it  illustrates  how  difficult  the 
problem  is.  Not  until  the  early  1970s,  however,  did  the  search  for  methods  to  predict 
conformation  begin  in  earnest.  Work  of  Anfinsen  and  colleagues  (1961)  showed  that 
denatured  proteins,  in  certain  cases,  could  refold  spontaneously  providing  experimental 
support  for  the  paradigm  that  the  protein  sequence  alone  can  determine  the  conformation 
of  a protein.  This  paradigm  remains  dominant  today,  despite  the  discovery  of  chaperones 
(Hard  1996),  evidence  that  some  proteins  form  metastable  structures  (Baker  and  Agard 
1994),  and  renewed  interest  in  protein  folding  pathways  (Dodge  et  al.  1994),  all  of  which 
suggest  that  protein  folding  has  a kinetic  as  well  as  a thermodynamic  component. 

There  are  two  general  classes  of  protein  structure  prediction  methods:  theoretical 
methods  that  make  use  of  the  physical  and  chemical  properties  of  the  residues  such  as 
hydropathy,  size,  and  charge  (Biou  et  al.  1988),  and  empirical  methods  which  use  the 
known  sequences  and  structures  of  proteins  (Clark  et  al.  1991).  Empirical  methods  can 


5 


themselves  be  divided  roughly  into  two  types:  those  based  on  sequence  homology 
between  the  protein  under  investigation  and  database  proteins  with  known  structures 
(Biou  et  al.  1988),  and  those  based  on  statistical  analyses  of  proteins  of  known  structure 
to  deduce  the  probability  of  an  amino  acid  appearing  in  a particular  structural  feature 
(Holley  and  Karplus  1989). 

Theoretical  Methods 

Physicochemical  methods  rely  on  physical  and  chemical  principles  to  rationalize 
and  predict  protein  conformation.  For  example,  hydrophobic  side  chains  are  more  likely 
to  be  buried  in  a protein  that  folds  in  water  than  are  hydrophilic  side  chains  (Schulz  and 
Schirmer  1979).  Lim  (1974)  noted  many  years  ago  that  a helix  might  be  identified  in  a 
polypeptide  sequence  from  a characteristic  3.6-residue  periodicity  in  the  placement  of 
hydrophilic  and  hydrophobic  residues.  Such  periodicity  can  be  easily  visualized  by  use 
of  a Schiffer-Edmundson  helical  wheel  (Figure  1-2).  The  hydrophobic  face  of  the 
amphiphilic  helix  is  often  found  to  be  buried  within  the  fold. 

The  notion  of  amphiphilicity  has  been  generalized  to  include  hydrophobic  moments 
of  secondary  structural  elements  (Eisenberg  et  al.  1989).  The  hydrophobic  moment  is  an 
analog  of  the  electric  dipole  moment,  except  that  it  measures  the  asymmetry  of  the 
hydrophobicity  in  a structure  rather  than  the  asymmetry  of  the  electrical  charge.  Thus,  a 
helix  with  hydrophobic  residues  on  one  side  and  hydrophilic  resides  on  the  other  has  a 
large  hydrophobic  moment  and  is  expected  to  be  stable  at  (for  example)  an  interface 
between  oil  and  water. 

Physicochemical  methods  for  predicting  secondary  structure  have  also  been  the 
subject  of  excellent  reviews  (Fasman  1989).  These  tools  have  shown  promise  when 
applied  to  single  sequences  in  some  cases  but  not  in  others.  These  are  discussed  in 


6 


greater  detail  below.  Further,  physicochemical  analyses  have  proven  to  be  important  in 
many  evolution-based  prediction  tools,  as  they  appear  to  be  more  readily  averageable 
than  statistical  methods. 


Figure  1-2.  Schiffer-Edmundson  helical  wheel  showing  the  position  of  hydrophobic  and 
hydrophilic  amino  acids.  This  particular  relative  orientation  of  the  side  chains 
can  be  used  as  a definition  of  a helix  (Schulz  and  Schirmer  1979). 

In  individual  cases,  failures  of  physicochemical  methods  to  make  correct  secondary 
structure  predictions  can  often  be  related  to  violations  of  folding  rules  by  proteins.  When 
such  violations  are  observed,  they  often  offer  the  biochemist  an  opportunity  to  engineer 
the  protein  to  improve  its  stability.  For  example,  if  a natural  protein  places  a 
hydrophobic  residue  on  its  surface,  a glycine  in  a helix,  or  an  acyclic  amino  acid  at  a 
position  in  a protein  where  a proline  would  fit  the  backbone  configuration  (Alber  1989, 
Matthews  et  al.  1987),  a more  stable  protein  can  often  be  obtained  by  replacing  the 
hydrophobic  residue  by  a hydrophilic  residue,  the  glycine  by  an  alanine,  or  the  flexible 
residue  by  a proline.  In  each  case,  the  mutation  makes  the  sequence  obey  the  folding 


7 


rules  better.  Examples  where  improved  stability  is  engineered  into  a protein  via  a single 
amino  acid  substitution  offer  additional  evidence  that  natural  selection  does  not  seek 
proteins  with  maximized  stability  (Benner  1989).  If  increased  stability  was  a goal  of 
natural  selection  and  achievable  by  simple  point  mutation,  evolutionary  processes  would 
have  already  introduced  the  changes  made  by  the  protein  engineer. 

Physicochemical  methods  of  increased  sophistication  use  energy  minimization, 
molecular  dynamics,  or  even  quantum  mechanical  tools.  These  tools  have  been  reviewed 
in  detail  elsewhere  (Bohm  et  al.  1992,  Mackay  et  al.  1989,  McCammon  et  al.  1989). 

Here,  the  limitations  of  the  methods  relate  directly  to  the  complexity  of  the  computations 
involved,  the  difficulties  associated  with  finding  optima  on  an  energy  surface,  and  the 
difficulties  in  obtaining  accurate  models  for  water,  side  chain-solvent  interactions,  and 
side  chain-side  chain  interactions.  Together,  these  have  often  defeated  direct 
computation  of  protein  conformation.  However,  some  interesting  cases  have  produced 
good  conformational  models  (Gibson  et  al.  1988).  Further,  the  increase  in  computational 
power  encouraged  many  groups  to  make  a direct  assault  on  the  de  novo  computation  of 
protein  conformation  (Kolinski  and  Skolnick  1994,  Srinivasan  and  Rose  1995).  Some  of 
these  have  now  been  shown  to  fail  in  specific  cases  in  a bona  fide  prediction  setting 
(Dunbrack  et  al.  1997). 

Empirical  Methods 
Statistical 

Empirical  statistical  methods  of  protein  structure  prediction  are  among  the  very 
first  ever  programs  to  predict  protein  structure  from  amino  acid  sequence.  They  have 
revolutionized  the  field  and  several  of  them  are  still  in  use  in  their  original  form.  The  two 


8 


most  widely  used  secondary  structure  prediction  methods  are  methods  devised  by  Chou 
and  Fasman  (1978)  and  Gamier  et  al.  (1978). 

The  Chou-Fasman  method,  developed  in  1974,  was  the  first  algorithm  developed 
for  predicting  the  secondary  structure  of  globular  proteins  (Chou  and  Fasman  1974).  The 
Chou-Fasman  algorithm,  in  addition  to  the  input  sequence,  uses  a table  of  conformational 
propensities  (Chou  and  Fasman  1974).  For  each  amino  acid,  this  table  gives  a value 
describing  the  given  amino  acid’s  propensity  to  be  found  in  an  a helix  or  a /?  strand  and 
categorizes  them  into  six  groups  based  on  frequency  for  each  of  the  two  secondary 
structure  elements.  A query  sequence  is  then  scanned  for  a region  where  three  of  five 
amino  acids  have  a high  probability  of  being  in  a beta  strand,  or  four  of  six  amino  acids 
have  a high  probability  of  being  in  an  alpha  helix  and  the  probability  of  amino  acids  in 
either  direction  is  calculated  for  being  in  that  type  of  structure  until  the  prediction  value 
drops  below  a specific  value.  The  propensity  scores  were  calculated  by  measuring  the 
frequencies  of  each  amino  acid  associated  with  a given  conformational  state  in  a database 
consisting  of  1 5 proteins.  The  frequencies  were  normalized  by  the  prevalence  of  the 
amino  acid  in  the  database.  The  authors  reported  77%  accuracy  (Chou  and  Fasman  1974) 
based  on  a test  set  that  was  used  to  develop  the  program,  but  this  was  refuted  later  with  a 
larger  test  set  and  the  accuracy  was  lowered  to  50%  (Kabsch  and  Sander  1983).  Despite 
the  low  accuracy,  this  program  is  still  widely  used  for  predicting  secondary  structure. 

The  method  of  Gamier  et  al.  (1978),  known  as  GOR,  after  the  initials  of  the  three 
authors,  describes  the  measure  as  the  information  content,  which  was  roughly  the  log  of 
the  propensity  used  by  the  Chou-Fasman  method.  Using  this  measure,  GOR  computes 
the  information  difference,  which  is  the  difference  between  the  likelihood  of  a 


9 


conformational  state  given  the  residue  and  the  likelihood  of  all  other  conformational 
states.  The  GOR  method  calculates  probability  values  for  a specific  amino  acid  based  on 
the  adjacent  amino  acids  up  to  eight  residues  away  using  principles  of  information  theory. 
The  use  of  a fixed  window  was  significant  in  that  it  foreshadows  a number  of  algorithms 
that  later  use  windows  to  generate  multi-dimensional  input  vectors.  Accuracy  of  the 
GOR  method  (Gamier  et  al.  1978)  was  cited  at  64%,  but  like  the  Chou-Fasman 
algorithm,  later  studies  (Nishikawa  1983)  described  a lower  accuracy  of  below  55%.  As 
in  the  analysis  of  the  Chou-Fasman  method,  the  GOR  method  was  created  at  a time  of 
relatively  scarce  protein  data. 

Sequence  similarity  methods 

Structure  can  be  predicted  by  detecting  a similarity  between  the  sequence  of  the 
protein  under  investigation  and  the  sequence  of  a protein  whose  structure  is  already 
known.  These  methods  use  the  assumption  that  sequence  similarity  leads  to  structural 
similarity  and  in  general  have  an  accuracy  of  65%  (Reimer  and  Fuellen  2003). 

If  a homolog  of  the  sequence  under  investigation  is  found  in  a protein  database  it  is 
relatively  easy  to  then  obtain  a 3D  picture  of  the  unknown  protein  by  homology  modeling 
(Segovia  1997).  However,  using  sequence  homology  is  not  always  useful;  no  structures 
are  known  for  related  proteins  or  no  related  proteins  are  found.  In  addition,  the  extent  of 
differences  in  structure  between  two  proteins  is  related  to  the  extent  of  the  amino  acid 
sequence  divergence  (Chothia  and  Lesk  1986),  which  could  affect  the  accuracy  of 
predictions. 

Levin  et  al.  (1986)  report  an  algorithm  that  predicts  secondary  structure  using  short 
sections  of  sequence  similarity,  even  if  the  whole  protein  is  not  homologous.  A similarity 
matrix  is  used  to  assign  a sequence  similarity  score  between  two  sequences.  If  the  score 


10 


is  above  a certain  threshold  then  the  sections  are  considered  homologous  and  thus  their 
structures  are  considered  homologous.  Structure  prediction  by  sequence  homology  was 
improved  further  by  Levin  et  al.  (1993)  by  the  use  of  multiple  sequence  alignments.  This 
provides  extra  information  about  sequence  to  structure  relationships  by  using  the 
alignment  of  the  sequences  of  a family  of  homologous  proteins  to  understand  which 
sections  of  sequence  are  important  for  a particular  structural  feature.  Levin  et  al.  (1993) 
determined  that  secondary  structure  predictions  can  be  reliably  improved  using 
alignments  from  an  automatic  alignment  procedure  with  a mean  increase  of  6.8%,  giving 
an  overall  prediction  accuracy  of  68.5%,  if  there  is  a minimum  of  25%  sequence  identity 
between  all  sequences  in  a family. 

Evolution-Based  Structure  Prediction 
The  fact  that  natural  proteins  are  the  products  of  divergent  evolution  creates 
opportunities  as  well  as  problems  when  developing  tools  for  predicting  conformation 
from  sequence  (Levitt  1992,  Benner  1989,  Pascarella  and  Argos  1994,  Wako  and 
Blundell  1994a,  Wako  and  Blundell  1994b,  Rost  and  Sander  1994,  Taylor  1993,  Zvelebil 
et  al.  1987).  Proteins  in  the  modem  world  almost  never  come  alone.  Rather,  nature 
presents  sets  of  homologous  proteins  (proteins  related  by  common  ancestry)  performing 
analogous  functions  in  different  organisms.  As  long  as  their  genes  have  continuously 
performed  a function  since  they  divergently  evolved,  homologous  proteins  retain  their 
overall  conformation.  Indeed,  this  conformation  can  be  retained  long  after  sequence 
similarity  has  been  lost  in  statistical  noise  (Chothia  and  Lesk  1986,  Rossman  et  al.  1975). 
This  is  quite  different  from  the  conformational  behavior  of  a homologous  series  of 
compounds  in  organic  chemistry,  a set  of  compounds  differing  in  the  length  of  a chain, 
where  conformation  between  members  need  have  no  similarities.  Natural  selection 


11 


acting  on  homologous  proteins  divergently  evolving  under  functional  constraints  is  the 
reason  for  this  difference. 

For  this  reason,  a set  of  sequences  of  proteins  within  a family  of  homologous 
proteins  contains  more  information  about  conformation  than  a single  sequence  or  a single 
member  of  the  family  (Benner  1989,  Zvelebil  et  al.  1987,  Sternberg  and  Cohen  1982, 
Maxfield  and  Scheraga  1979,  Lenstra  et  al.  1977,  Crawford  et  al.  1987,  Bowie  et  al. 

1991,  Shortle  1995).  The  set  of  protein  sequences  is  a set  of  different  molecular 
structures  that  achieve  (more  or  less)  the  same  conformation. 

Understanding  the  Details  of  Molecular  Evolution 

Alignment 

To  have  a transparent  view  of  evolutionary  analysis  as  a tool  for  making  secondary 
structure  predictions,  we  must  begin  by  understanding  the  key  element  of  evolutionary 
analysis:  sequence  alignment  (Molecular  Evolution,  1990).  An  alignment  attempts  to 
represent  the  evolutionary  relationship  between  two  protein  sequences  by  placing  them 
side-by-side.  In  this  way  an  alignment  shows  which  amino  acid  substitutions  have  been 
accepted  since  the  two  proteins  diverged  from  their  common  ancestor.  These 
substitutions  are  not  random  if  the  descendent  proteins  have  served  functions  in  the 
descendent  organisms  (that  is,  assuming  that  the  proteins  have  diverged  under  functional 
constraints).  Most  proteins  have  a function  that  contributes  to  the  ability  of  their  host 
organism  to  survive,  select  a mate,  and  reproduce.  To  perform  this  function,  proteins 
adopt  a fold  (or  tertiary  structure),  a structure  that  is  conserved  much  more  highly  than 
the  sequence  itself. 

Function  therefore  constrains  what  amino  acid  substitutions  are  accepted  during 
divergent  evolution;  some  substitutions  are  never  observed  because  they  are  lethal  to  the 


12 


host  organisms.  Other  substitutions  help  the  protein  perform  its  selective  function 
(positive  or  adaptive  substitutions)  and  will  be  incorporated  at  a high  rate,  especially 
when  a new  function  is  emerging.  Still  other  substitutions  represent  neutral  drift  in  the 
structure  (King  and  Jukes  1969,  Kimura  1982)  having  no  selectable  impact  on  the  fitness 
of  the  protein. 

The  basic  element  of  score  is  the  probability  that  the  proteins  whose  sequences  are 
being  aligned  are  in  fact  related  by  common  ancestry.  This  score  is  often  expressed  as 
logarithm  of  the  probability  that  the  similarities  in  the  two  sequences  seen  in  the 
alignment  arose  by  reason  of  common  ancestry,  divided  by  the  probability  that  these 
similarities  arose  by  random  chance.  This  probability  is  generally  obtained  by  comparing 
the  aligned  sequences  one  position  at  a time.  Under  this  procedure,  a score  is  first  given 
to  each  pair  of  amino  acids  matched  in  the  alignment  (Dayhoff  et  al.  1978). 

Substitution  Matrices 

We  can  derive  a mutation  matrix  showing  the  probability  of  amino-acid 
substitution  in  proteins  undergoing  divergent  evolution  subject  to  functional  constraints 
(that  is,  under  circumstances  where  the  descendent  proteins  must  fold  and  perform 
function).  Suppose  that  we  extracted  from  the  database  a sample  of  aligned  sequence 
pairs  where  homology  was  indisputable.  In  this  example,  all  of  the  sequence  pairs  are 
99%  identical;  sequence  pairs  that  suffered  only  1 point  accepted  mutation  per  100  amino 
acids.  A matrix  is  then  constructed,  recording  the  occurrence  of  all  of  the  210  possible 
pairings  of  amino  acids  in  the  alignments.  This  would  provide  an  empirical  statement 
regarding  the  pattern  of  amino-acid  substitution  in  this  set  of  homologous  proteins. 

The  matrix  must  then  be  normalized.  It  is  not  sufficient  to  tabulate  the  probability 
that  a pair  of  amino  acids  is  found  matched  in  an  alignment  of  two  homologous 


13 


sequences  at  a specific  evolutionary  distance.  Since  amino  acids  generally  occur  with 


different  frequencies  in  proteins,  the  probability  of  each  matching  arising  by  random 


chance  is  not  the  same.  Thus,  dividing  them  by  the  probability  of  the  matching  occurring 


by  chance  must  normalize  the  probabilities. 


The  result  of  this  work  was  a Dayhoff  Matrix  (Figure  1-3)  that  can  be  used  to  score 


and  align  protein  sequences  (Dayhoff  et  al.  1978).  Elements  of  the  scoring  matrix  are 


usually  the  logarithms  of  the  normalized  probabilities  multiplied  by  ten.  Thus,  the  matrix 


elements  are  ten  times  the  log  of  (the  probability  that  two  amino  acids  will  be  matched  by 


homology;  divided  by  the  probability  that  they  would  be  matched  by  chance). 


A 

R 

N 

D 

C 

Q 

E 

G 

H 

I 

L 

K 

M 

F 

P 

S 

T 

W 

Y 

V 

A 

2 

R 

-2 

6 

N 

0 

0 

2 

D 

0 

-1 

2 

4 

C 

-2 

-4 

-4 

-5 

12 

Q 

0 

1 

1 

2 

-5 

4 

E 

0 

-1 

1 

3 

-5 

2 

4 

G 

1 

-3 

0 

1 

-3 

-1 

0 

5 

H 

-1 

2 

2 

1 

-3 

3 

1 

-2 

6 

I 

-1 

-2 

-2 

-2 

-2 

-2 

-2 

-3 

-2 

5 

L 

-2 

-3 

-3 

-4 

-6 

-2 

-3 

-4 

-2 

2 

6 

K 

-1 

3 

1 

0 

-5 

1 

0 

-2 

0 

-2 

-3 

5 

M 

-1 

0 

-2 

-3 

-5 

-1 

-2 

-3 

-2 

2 

4 

0 

6 

F 

-4 

-4 

-4 

-6 

-4 

-5 

-5 

-5 

-2 

1 

2 

-5 

0 

9 

P 

1 

0 

-1 

-1 

-3 

0 

-1 

-1 

0 

-2 

-3 

-1 

-2 

-5 

6 

S 

1 

0 

1 

0 

0 

-1 

0 

1 

-1 

-1 

-3 

0 

-2 

-3 

1 

2 

T 

1 

-1 

0 

0 

-2 

-1 

0 

0 

-1 

0 

-2 

0 

-1 

-3 

0 

1 

3 

W 

-6 

2 

-4 

-7 

-8 

-5 

-7 

-7 

-3 

-5 

-2 

-3 

-4 

0 

-6 

-2 

-5 

17 

Y 

-3 

-4 

-2 

-4 

0 

-4 

-4 

-5 

0 

-1 

-1 

-4 

-2 

7 

-5 

-3 

-3 

0 

10 

V 

0 

-2 

-2 

-2 

-2 

-2 

-2 

-1 

-2 

4 

2 

-2 

2 

-1 

-1 

-1 

0 

-6 

-2 

4 

A 

R 

N 

D 

C 

Q 

E 

G 

H 

I 

L 

K 

M 

F 

P 

S 

T 

W 

Y 

V 

Figure  1-3.  The  log  odds  matrix  for  PAM250  (multiplied  by  10). 

It  is  possible  to  align  two  protein  sequences  that  are  less  than  99%  identical. 


However,  the  scoring  matrix  must  be  different.  In  particular,  the  diagonal  terms  of  the 
matrix  (representing  conserved  amino  acids)  become  smaller  as  two  protein  sequences 
diverge,  while  the  off-diagonal  terms  of  the  matrix  become  larger.  Thus,  the  matrix  used 


14 


in  scoring  an  alignment  must  reflect  the  evolutionary  distance  between  two  sequences 
being  aligned. 

Gaps  in  an  Alignment 

During  divergent  evolution,  portions  of  genes  may  be  added  (inserted)  or  removed 
(deleted).  This  results  in  homologous  proteins  that  contain  different  numbers  of  amino 
acids.  This  implies,  in  turn,  that  an  alignment  of  sequences  within  a family  of  proteins 
where  insertions  and  deletions  (indels)  have  taken  place  will  have  unmatched  amino 
acids,  which  form  gaps  in  the  alignment.  In  an  alignment  of  just  two  homologous 
sequences,  it  is  impossible  to  tell  whether  the  gap  arose  from  an  insertion  event  in  the 
lineage  leading  to  the  protein  with  additional  amino  acids  (implying  that  the  ancestral 
protein  had  fewer  amino  acids);  or  whether  the  gap  arose  from  an  deletion  event  that 
removed  amino  acids  from  the  ancestral  sequence  in  the  lineage  leading  to  the  protein 
with  fewer  amino  acids. 

The  placement  of  gaps  is  a critical  step  when  constructing  an  alignment. 
Considerable  research  has  been  devoted  to  understanding  how  gaps  should  be  placed 
(Benner  et  al.  1993,  Pascarella  and  Argos  1992).  In  practice,  one  does  not  know  which 
amino  acids  have  been  inserted  or  deleted.  Gaps  are  placed  to  optimize  a score 
associated  with  an  alignment.  But  if  gaps  are  introduced  without  limit,  even  two  random 
sequences  can  be  aligned  to  give  a perfect  score.  Therefore,  gaps  must  be  penalized  to 
enforce  their  judicious  use.  The  most  common  scheme  for  penalizing  gaps  charges  a 
price  for  introducing  a gap;  and  an  incremental  price  for  each  additional  amino  acid  that 


is  added  to  the  gap. 


15 


Parsing  Strings 

Much  of  the  success  of  transparent  tools  for  predicting  helices  and  strands  arises 
from  tools  that  predict  regions  that  are  not  helices  or  strands.  Parsing  tools  divide  a 
protein  sequence  into  segments  that  form  standard  secondary  structure  independently.  By 
parsing  a sequence,  secondary  structure  predictions  need  consider  at  any  one  time  only 
short  segments  of  the  polypeptide  chain,  which  is  intrinsically  easier  than  considering  the 
polypeptide  chain  as  a whole.  Thus,  understanding  the  evolution  of  loops  is  an  important 
step  toward  developing  tools  for  predicting  secondary  structure  in  proteins.  This  is  the 
central  theme  of  the  work  presented  in  Chapter  4. 

Neutral  versus  Adaptive  Variation 

Two  types  of  variation  occur  as  protein  sequences  divergently  evolve.  Neutral 
variation  involves  substitutions  that  do  not  influence  the  ability  of  an  organism  to  survive 
and  reproduce  (King  and  Jukes  1969,  Kimura  1982).  These  are  variations  that  have  little 
impact  on  behavior  in  a protein.  From  a structural  viewpoint,  such  variations  should  lie 
predominantly  on  the  surface  of  the  folded  structure.  Thus,  neutral  variation  is  sought 
when  attempting  to  identify  surface  positions  by  seeking  variation  in  an  alignment. 

Adaptive  substitutions  accumulate  as  well  during  divergent  evolution,  however. 
Adaptive  substitutions  alter  the  behavior  of  the  protein,  often  to  make  it  better  suited  for  a 
new  environment  or  a new  function.  Mutations  that  alter  function  or  create  new  function 
are  the  opposite,  structurally,  of  mutations  that  do  not  influence  function,  and  adaptive 
variation  need  not  lie  on  the  surface  of  a protein.  Indeed,  it  may  lie  near  an  active  site,  a 
regulatory  site,  or  inside  the  folded  structure  of  a protein  (Benner  1989). 

Unfortunately,  neutral  and  adaptive  variation  appears  the  same  at  first  inspection  of 
a multiple  alignment.  To  use  variation  to  identify  surface  positions,  therefore,  heuristics 


16 


must  be  developed  that  separate  (as  much  as  possible)  adaptive  variation  from  neutral 
variation.  No  filter  is  known  that  reliably  distinguishes  between  neutral  and  adaptive 
variation,  as  a rich  literature  in  the  field  shows  (Kimura  1982).  However,  a filter  built  on 
the  notion  of  “concurrent  variation”  has  proven  to  be  rather  effective  for  the  purpose  of 
structure  prediction  (Benner  et  al.  1994).  To  apply  this  filter,  positions  are  identified  in  a 
multiple  alignment  where  variation  is  observed  simultaneously  in  different  sub-branches 
of  the  evolutionary  tree.  A position  is  assigned  to  the  surface  of  the  folded  structure  only 
if  it  is  variable  in  more  than  one  sub-branch  of  an  evolutionary  tree  relating  the 
sequences. 

Models  of  Protein  Sequence  Divergence 

The  evolution  of  protein  sequences  is  nearly  always  described  using  one  of  several 
stochastic  models  for  the  accumulation  of  amino  acid  replacements  (Benner  et  al.  1997). 
These  are  captured  in  algorithms  known  by  widely  recognized  names  (e.g.,  the 
Needleman-Wunsch  (1970)  and  Smith- Waterman  (1981)  maximum  likelihood  tools). 
These  tools  have  become  more  sophisticated  in  recent  years,  as  mathematicians  have 
managed  to  improve  these  types  of  models. 

Nevertheless,  patterns  of  replacement  predicted  by  these  mathematical  models 
remain  quite  different  from  the  patterns  that  are  actually  observed  in  proteins  diverging 
under  functional  constraints  (Benner  et  al.  1997).  The  reason  for  these  differences  is  well 
understood.  Briefly,  simple  stochastic  methods  treat  proteins  as  if  they  were  linear 
strings  of  letters.  In  reality,  proteins  have  three-dimensional  structures  that  support 
behaviors  that  are  important  for  them  to  contribute  to  the  fitness  of  the  host  (function). 
These  behaviors  are  not  a linear  sum  of  the  behaviors  of  their  parts.  Amino  acid 
replacement  is  therefore  constrained  in  a way  unanticipated  for  a linear  string  of  letters. 


17 


For  structure  prediction,  the  conservation  of  conformation  after  substantial 
sequence  divergence  has  an  important  corollary:  if  one  knows  the  conformation  of  one 
member  of  a protein  family,  one  knows  (more  or  less)  the  conformation  of  all  other 
members  of  the  family.  This  corollary  generated  the  field  of  homology  modeling.  In  this 
field,  the  conformation  of  a target  sequence  is  modeled  by  extrapolating  an  experimental 
conformation  of  a homolog  with  known  structure.  This  field  has  also  created  the 
incentive  to  develop  methods  for  detecting  very  distant  homologs  of  proteins,  as  these  are 
the  starting  points  for  homology  modeling. 

Homology  modeling  is  one  type  of  approach  that  uses  evolutionary  analyses  to 
predict  protein  conformation.  The  second  type  of  approach,  often  referred  to  as  ab  initio 
structure  prediction,  seeks  structure  of  a family  of  proteins  where  no  member  of  the 
family  has  a known  experimental  conformation.  Ab  initio  prediction  is  our  primary 
focus. 

This  dissertation  concerns  the  development  and  assessment  of  evolution-based 
structure  prediction  tools,  especially  as  they  apply  to  secondary  structure  prediction  using 
a transparent  analysis  of  multiple  sequence  alignments.  For  this  purpose,  I have 
constructed  a database  (Chapter  2).  From  this  database,  I extracted  secondary  structural 
information  and  analyzed  internal  helices,  which  are  especially  refractory  to  prediction 
using  this  approach  (Chapter  3).  In  addition,  this  database  was  consulted  to  analyze 
parsing  elements  that  break  a protein  sequence  into  secondary  structural  elements 
(Chapter  4).  In  conclusion,  evolution-based  strategies  were  applied  to  a community  wide 
experiment  on  the  critical  assessment  of  techniques  for  protein  structure  prediction 


(Chapter  5). 


CHAPTER  2 

CREATION  OF  THE  DATABASE 

Introduction 

The  first  step  in  evolution-based  prediction  of  protein  conformation  involves  the 
assembly  of  a set  of  proteins  related  by  common  ancestry,  and  the  alignment  of  their 
sequences.  Each  process  needs  a model  for  protein  sequence  evolution,  and  requires,  in 
principle,  two  steps.  The  first  requires  that  a series  of  pairs  of  proteins  be  examined  with 
the  intention  of  deciding  whether  or  not  they  are  homologs.  This  is  a “yes-no”  decision 
(two  proteins  are  related  by  common  ancestry,  or  they  are  not).  In  practice,  however, 
there  is  a gray  area,  where  sequence  similarity  set  in  the  context  of  a model  for  evolution 
is  inadequate  to  decide  whether  the  two  proteins  are  homologous  or  not;  in  these  cases, 
we  chose  “not”,  as  incorporating  the  distant  homolog  sequence  will  most  likely  create  an 
unsatisfactory  alignment. 

The  second  step  begins  with  pairs  of  protein  sequences  that  have  been  assigned  as 
homologs.  Here,  we  must  determine,  through  alignment,  the  evolutionary  relationship 
between  the  individual  amino  acids  in  the  two  sequences.  Aligning  amino  acid  i in 
protein  I with  amino  acid  j in  protein  J is  equivalent  to  making  the  statement  that  the 
codon  that  encodes  i in  I and  the  codon  that  encodes  j in  J are  descendents  of  a single 
codon  in  the  gene  in  the  last  common  ancestor  of  I and  J. 

Construction  of  the  Database 

To  study  protein  sequence  divergence  relative  to  protein  folding,  one  must 
construct  a database  containing  protein  sequences  and  secondary  structural  information. 


18 


19 


To  collect  sequences  that  were  suitable  for  this  study,  we  exploited  a collection  of 
precomputed,  modularized,  and  aligned  set  of  protein  sequences.  Such  a set  is  defined  as 
a family  in  the  MasterCatalog  (MC).  These  families  were  built  from  sequences  extracted 
from  GenBank  (version  129).  The  MasterCatalog  (EraGen  Biosciences,  Madison, 
Wisconsin),  developed  in  collaboration  with  EraGen,  organizes  all  of  these  according  to 
their  evolutionary  histories.  All  of  the  families  in  the  MC  that  were  associated  with 
experimental  secondary  structures  were  identified.  Families  were  retained  if  they  had 
only  one  experimental  secondary  structure. 

Distinct  sequences  (probe  sequences)  that  are  associated  with  only  one  crystal 
structure  were  accepted.  Then  all  of  the  MC  families  containing  elements  from  these 
sequences  were  collected.  Next,  all  of  the  members  of  these  families  were  recorded.  It 
was  then  necessary  to  loop  through  these  families  to  eliminate  those  sequences  that  are 
built  in  part  from  modules  containing  more  families  than  the  probe  sequence.  This 
assured  a distinct  list  of  proteins,  each  of  which  has  one  sequence  and  one  Protein  Data 
Bank  (PDB)  set  of  coordinates.  The  output  is  a set  of  protein  sequences,  wherein  each 
member  of  the  set  is  associated  with  a single  PDB  entry  and  where  no  two  sequences 
contain  an  element  from  the  same  family.  These  were  retrieved  to  determine  the  number 
of  sequences  in  each  family  and  the  evolutionary  divergence  separating  those  sequences. 
This  was  accomplished  by  running  the  program  GetSeqIds.pl,  shown  in  Figure  A-l  of 
Appendix  A. 

We  retrieved  1027  sequences  by  this  method.  The  MC  modules  refused  to  give 
full-length  sequences.  The  collection  of  sequences  were  culled  to  remove  one  of  a pair 
(as  long  as  it  was  not  the  probe  sequence)  of  sequences  in  which  the  pairs  were  less  than 


20 


or  equal  to  1 PAM  distance  unit  from  each  other.  In  addition,  all  sequences  that  were 
greater  than  or  equal  to  125  PAM  distance  units  from  the  probe  sequence  were  discarded. 
This  resulted  in  666  sequences.  These  sequence  gi  numbers,  a series  of  digits  that  are 
assigned  consecutively  to  each  sequence  record  processed  by  the  National  Center  for 
Biotechnology  Information  (NCBI),  are  listed  in  Table  A-l  in  Appendix  A. 

A program  was  written  to  obtain  the  full-length  sequences,  spanning  all  of  the 
families  of  the  full-length  sequences,  and  to  cull  the  sequences  according  to  the 
requirements  identified  above.  Full-length  sequences  ensured  that  sufficient  sequence 
variability  was  present  to  enable  simple  heuristics  to  make  reasonable  predictions  of 
surface  accessibility.  In  addition,  the  output  of  this  program  prepared  an  input  file  for  the 
Darwin  server.  This  program,  PrepareSeqIds.pl,  is  shown  in  Figure  A-2  of  Appendix  A. 

Full-length  sequences  were  submitted  to  the  Darwin  server  in  order  to  generate 
multiple  sequence  alignments.  Darwin  stands  for  Data  Analysis  and  Retrieval  with 
Indexed  Nucleotide/peptide  sequences.  Darwin  is  an  interactive  tool  for  peptide  and 
nucleotide  sequence  analysis.  Darwin  is  a programming  environment  with  its  own, 
modem  language  that  has  a growing  library  of  functions  for  sequence  management  and 
analysis,  statistics,  graphics,  and  more  (Gonnet  et  al.  2000).  An  example  of  a Darwin 
program,  darwin.ma,  is  shown  in  Figure  A-3  of  Appendix  A.  This  program  generated  the 
multiple  sequence  alignments  used  in  this  study. 

Darwin  was  chosen  as  a tool  to  calculate  multiple  sequence  alignments  because 
there  are  a number  of  functions  in  Darwin  that  perform  better  than  other  systems.  Darwin 
computes  a multiple  sequence  alignment  together  with  probable  ancestral  sequences.  A 
probable  ancestral  sequence  is  a probability  profile  of  the  possible  amino  acids  in  each 


21 


position  of  the  root  of  the  tree.  The  multiple  sequence  alignment  derived  in  this  way  is 
usually  much  better  than  the  ones  produced  by  other  methods.  Also,  the  Dayhoff  matrix 
computed  by  Dayhoff  and  co workers  (1978)  was  based  on  an  insufficient  number  of 
matched  amino  acid  pairs  to  sustain  an  analysis  of  substitution  rates  no  more 
sophisticated  than  that  implied  by  the  Markov  model.  Today,  it  is  a relatively  easy  (and 
computationally  feasible)  task  to  gather  on  the  order  of  millions  of  amino  acid  pairings. 
One  would  typically  need  to  perform  a “self-matching”  of  only  one  entire  database  (such 
as  Swiss-Prot)  to  gather  a sufficient  amount  of  data.  As  a result,  an  improved  Dayhoff 
matrix  is  employed  in  the  current  Darwin  system. 

Fraction  of  surface  accessibility  was  then  assigned  for  each  residue  in  the 
representative  protein  using  the  Database  of  Secondary  Structure  in  Proteins  (DSSP) 
(Kabsch  and  Sander  1983).  The  PDB  database  contains  output  generated  by  applying  the 
DSSP  program  to  the  coordinates  for  the  crystal  structure.  From  this  output  I extracted 
the  PDB  sequence,  secondary  structure  for  segments  of  the  sequence,  and  normalized 
surface  accessibility  for  each  residue  in  the  sequence.  The  DSSP  mirrored  database  was 
consulted  to  extract  the  appropriate  PDB  sequence  corresponding  to  the  probe  sequence. 
The  PDB  sequence  rarely  begins  or  ends  with  the  correct  amino  acids,  with  respect  to  the 
probe  sequence,  and  often  misses  a chunk  of  amino  acids  within  the  sequence  itself.  As  a 
result,  a dynamic  programming  algorithm  was  written  in  order  to  semi-globally  align  the 
multiple  sequence  alignment  to  the  PDB  sequence  along  with  secondary  structure 
information,  surface  accessibility  values,  strong  and  weak  surface  assignments,  and 
strong  and  weak  interior  assignments.  This  program  is  shown  in  Figure  A-4  in  Appendix 


A. 


22 


Since  this  information  was  required  frequently  in  our  study,  we  constructed  a 
database  to  capture  these  surface  accessibility  values  for  all  of  the  protein  families.  A 
program  was  written  to  display,  in  a consistent  manner,  all  of  the  features  of  homologous 
protein  sequences  and  secondary  structural  information.  This  program  was  named 
Master.pl  and  is  shown  in  Figure  A-5  of  Appendix  A.  This  program  incorporates  perl 
modules  that  were  written  to  extract  information  from  the  Darwin  output,  the  DSSP 
output,  and  the  MySQL  database  from  which  the  MC  families  were  located.  The 
database  has  the  following  form. 

1 . The  Multiple  Sequence  Alignment  from  the  MasterCatalog  family. 

2.  An  aligned  Protein  Data  Bank  crystal  structure. 

3.  An  aligned  secondary  structure. 

4.  A set  of  numbers,  the  surface  accessibility  parameter,  for  each  site  in  the  Multiple 
Sequence  Alignment  where  the  representative  protein  has  a residue.  This  was 
accomplished  by  writing  a single  digit  below  the  Multiple  Sequence  Alignment, 
where  that  number  is  1 0 times  the  fraction  surface  accessibility  that  is  measured  for 
the  side  chain,  rounded  down  to  the  nearest  tenth.  When  the  fraction  of  side 
exposure  was  greater  than  1 .0,  the  number  9 was  recorded. 

An  example  of  the  form  of  this  resource  is  shown  in  Figure  2-1 . The  header 

information  provides  the  MasterCatalog  sequence  identifier  (Seqld),  the  NCBI  identifier 

(gi  number),  the  PDB  code,  and  the  PAM  distance  of  the  two  most  distant  sequences  in 

the  family  (MaxPAM).  The  sequences  themselves  are  prefaced  by  a letter,  which  is  an 

identifier  for  the  alignment  program.  The  footer  information  contains  four  lines.  These 

are  prefaced  by  *,  $,  !,  and  These  symbols  refer  to  the  crystal  structure,  the  secondary 

structure,  the  surface  accessibility  parameter,  and  a prediction  of  surface  and  interior 

sites,  respectively. 


23 


Multiple  Sequence  Alignment:  gi | 2415721 
Seqld:  1069 
PDB:  li6w 
MaxPAM:  143.0 

1 . .74 

c - 

e - MKFIKRRIIALVTILVLSVTSLFAMQPSAKAAEHNPWMVHG_IGGASYNFAGIKSYLVSQGWSRGKLYAVDFW 
j - MKFVKRRIIALVTILMLSVTSLFALQPSAKAAEHNPWMVHG_IGGASFNFAGIKSYLVSQGWSRDKLYAVDFW 
g - MKFVKRRI IALVTILMLSVTSLFALQPSAKAAEHNPWMVHG_IGGAPFNFAGTKSYPVSQGWSRDKLYAVDFW 

i _ HNPWLVHG_ISGASYNFFAIKNYLISQGWQSNKLYAIDFY 

d - HNPWLVHG_ISGASYNFFAIKNYLISQGWQSNKLYAIDFY 

h - HNPWMVHG_IGGASYNFFSIKSYLATQGWDRNQLYAIDFI 

b - HNPWMVHG_IGGASYNFASIKSYLVGQGWDRNQLFAIDFI 

a _ HNPWMVHG_MGGASYNFASIKRYLVSQGWDQNQLFAIDFI 

f - PWLVHGTFGNRGYTWNTAVPLLRRHG HRVFRLD_Y 


* HNPWMVHG-IGGASFNFAGIKSYLVSQGWSRDKLYAVDFW 

$ - CCCEEEECC  CCCCHHHHHHHHHHHHHCCCCHHHEEECCCC 

! - 530000000  4712172043015103836055730210505 

@ - silllllll  islilsiliillsIIiSisIsssilillsIs 

75  ..148 

C LRDFVEAVRGATGAAKVDIVGHSQGGMLPRYYVKFLGGADKVDDLV 

e - DKTGTNY .NNGPVLSRFVQKVLDETGAKKVDIVAHSMGGANTLYYIKNLDGGNKIENWTLGGANRL 

j - DKTGTNY NNGPVLSRFVQKVLDETGAKKVDIVAHSMGGANTLYYXKNLDGGNKVANWTLGGANRL 

g - DKTGTNY NNGPVLSRFVQKVLDETGAKKVDIVARSMGGANTLYYIKNLDGGNKVANWTLGGANRL 

i - DKTGNNL NNG  PQLAS YVDRVLKETGAKKVD I VAH  SMGGANTL YY I KYLGGGNK I QNWTLGGANGL 

d - DKTGNNL NNGPQLAS YVDRVLKETGAKKVDIVAH SMGGANTL YYIKYLGGGNKIQNWTLGGANGL 

h - DKTGNNR NNGPRLSRFVKDVLDKTGAKKVDIVAHSMGGANTLYYIKNLDGGDKIENWTIGGANGL 

b - DKTGNNR NNGPRLSRFVKDVLDKTGAKKVDIVAH SMGGANTL YYIKNLDGGDKIENVIPIGGANGL 

a - DKTGNNL NNGPRL  S RFVKDVLAKTGAKKVD I VAH  SMGGANTL  YYI KNLDGGDK I ENWTLGGANGL 

f - GQHGNPLIFGLGDIKHSARQLADFVDEVLRRTGAQQVDLVGFSQGGMMPRYYLNALGGGPKVHNFVGISPSNHG 

* ★ ★ *★*  ★★★  ★★  * * * 


* - DKTGTNY NNGPVLSRFVQKVLDETGAKKVDIVAHSMGGANTLYYIKNLDGGNKVANWTLGGANRL 

$ - CCCCCHH  HHHHHHHHHHHHHHHHHCCCCEEEEEECHHHHHHHHHHHHCCHHHCEEEEEEECCCHHH 

! - 1660327  30021016203500763827000000011000000200553300410010000000026 

& - Issliis  illilllsilisIIssiSisIIIIIIIIIIIIIIIillssiillillllllllllllis 

149  ..222 

c - 

e - TTSKALPGTDPNQKILYTSIYSSADMIVMNYLSKLDGAKNVQIHGVGHIGLLMNSQVNSLIKEGLNGGGLNTN 
j - TTGKALPGTDPNQKILYTSIYSSADMIVMNYLSRLDGARNVQIHGVGHIGLLYSSQVNSLIKEGLNGGGQNTN 
g - TTGKALPGTDPNQKILYTSIYSSADMIVMNYLSRLDGARNVQIHGVGHIGLLYSSQVNSLIKEGLNGGGQNTN 
i - VSSTALPGTDPNQKILYTSIYSLNDQIVINSLSRLQGARNIQLYGIGHIGLLSNSQVYGYIKKGLNGGGLNTN 
d - VSSTALPGTDPNQKILYTSIYSLNDQIVINSLSRLQGARNIQLYGIGHIGLLSNSQVNGYIKEGLNGGGLNTN 
h - VSSRALPGTDPNQKILYTSVYSSADLIWNSLSRLIGARNILIHGVGHIGLLTSSQVKGYIKEGLNGGGQNTN 
b - VSSRALPGTDPNQKILYTSVYSSADLIWNSLSRLIGARNVLIHGVGHIGLLTSSQVKGYIKEGLNGGGQNTN 
a - VSLRALPGTDPNQKILYTSVYSSADLIWNSLSRLIGARNVLIHGVGHIGLLTSSQVKGYVKEGLNGGGQNTN 
f - VTAQGL 


* - TTGKALPGTDPNQKILYTSIYSSADMIVMNYLSRLDGARNVQIHGVGHIGLLYSSQVNSLIKEGLNGGGQNTN 
$ - CCCECCCCCCCCCCCEEEEEEECCCCCCCHHHHCCECCEEEEECCCCCHHHHHCHHHHHHHHHHHCCCCECCC 
! - 2116013372895402000000420860714204041023150591406300518501410130044315077 
@ - illsIliisiSSsililllllliilSsIsliilililliilsIsSlilsillsISsIIilllilliiilsIss 


Figure  2-1 . Example  of  a family  from  the  protein  sequence  database. 


24 


The  representation  of  different  amino  acids  in  the  protein  database  is  shown  in 


Table  2-1. 


Table  2-1.  Frequencies  of  the  standard  amino  acids  in  the  protein  sequence  database 


Frequency 

(%) 

Amino 

acids 

Average 

frequency 

Frequency 

(%) 

Amino 

acids 

Average 

frequency 

9.33 

Leu 

6.78 

6.69 

Glu 

4.15 

5.81 

Ser 

6 fold  codons 

5.89 

Asp 

2 fold  codons 

5.19 

Arg 

5.81 

Lys 

4.28 

Asn 

8.70 

Ala 

6.63 

3.98 

Phe 

7.23 

Gly 

4 fold  codons 

3.92 

Gin 

7.22 

Val 

3.39 

Tyr 

5.39 

Thr 

2.34 

His 

4.60 

Pro 

1.08 

Cys 

5.67 

He 

5.67 

2.08 

Met 

1.74 

3 fold  codon 

1.40 

Ire 

1 fold  codons 

Using  this  protein  database,  we  identify  parses  that  separate  secondary  structural 


units,  assign  interior  and  surface  residues  from  patterns  of  variation  and  conservation,  and 
predict  segments  of  secondary  structure. 


CHAPTER  3 

PREDICTING  SEGMENTS  OF  SECONDARY  STRUCTURE 

Introduction 

Proteins  are  built  from  20  different  amino  acids  having  different  side  chains.  The 
differences  in  the  structure  of  the  side  chains  could,  in  principle,  influence  local 
conformation,  also  known  as  “secondary  structure.”  The  hypothesis  that  the  20  different 
amino  acids  might  therefore  have  different  propensities  to  adopt  different  secondary 
structures  (alpha  helices,  beta  strands,  and  coils)  was  originally  supported  by  the 
observation,  made  using  statistical  surveys  of  proteins  having  known  conformations,  that 
the  20  amino  acids  appeared  in  the  three  standard  secondary  structural  elements  (helix, 
strand,  or  coil)  with  distinct,  nonrandom  distributions  (Chou  and  Fasman  1974).  Here, 
propensity  was  defined  as  the  ratio  between  the  probability  of  an  amino  acid  appearing  in 
one  of  these  elements  divided  by  the  probability  of  the  average  amino  acid  appearing  in 
this  element.  For  example,  the  helical  propensity  for  Ala  was  found  by  Chou  and  Fasman 
to  be  1.45,  while  the  strand  propensity  for  Ala  was  found  to  be  0.97  (Table  3-5). 

This  suggests  that  certain  amino  acids  may  be  energetically  more  favored  in  alpha 
helices  whereas  others  are  more  favored  in  beta  strands  or  coils.  A quantitative 
understanding  of  such  rankings  would  greatly  enhance  our  ability  to  predict  secondary 
structure  from  sequence  and  to  rationally  modify  the  stability  and  properties  of  proteins. 
Weaknesses  of  Empirical  Statistical  Methods 

Statistical  methods,  such  as  Chou-Fasman  and  GOR,  are  easily  used,  automated, 
and  widely  employed  by  non-experts.  Unfortunately,  the  median  per  residue  scores 


25 


26 


achieved  by  these  methods  is  only  about  60-65%,  and  it  varies  greatly  depending  on  how 
closely  the  protein  resembles  those  for  which  the  parameters  are  derived.  There  are 
compelling  arguments  suggesting  that  these  methods  are  not  likely  to  yield  major  further 
improvements  in  the  tools  for  predicting  the  conformation  of  proteins. 

The  most  prominent  weakness  is  their  underlying  strategy  of  assuming  that  local 
conformation  (secondary  structure)  is  predominantly  determined  by  local  sequence.  The 
methods  assign  secondary  structure  to  a polypeptide  segment  by  examining  a sliding 
window  and  ignoring  the  influence  on  the  rest  of  the  protein  on  secondary  structure. 
Unfortunately,  much  information  shows  that  distant  interactions  in  a protein  that  are  not 
local  in  the  sequence  exert  a greater  influence  over  local  conformation  than  local 
sequence.  For  example,  identical  pentapeptide  sequences  were  known  in  two  different 
proteins  that,  in  one,  formed  a helix  and  in  the  other  formed  a strand.  From  this,  it  was 
suggested  that  the  likelihood  of  a pentapeptide  in  two  tertiary  structural  contexts  forming 
the  same  secondary  structure  was  not  substantially  larger  than  the  likelihood  that  two 
random  pentapeptides  would  (Kabsch  and  Sander  1984). 

More  recently,  hexapeptides  and  heptapeptides  having  identical  sequence  in  two 
proteins  have  been  shown  to  have  different  secondary  structures.  Together,  these  data 
imply  that  tertiary  structural  interactions  are  more  important  than  local  sequence,  and 
perhaps  decisive,  in  determining  secondary  structure.  For  protein  prediction,  this 
suggested  that  aspects  of  tertiary  structure  must  be  predicted  before  secondary  structure 
can  be  predicted. 

Second,  probabilistic  methods  may  perform  better  on  proteins  that  adopt  a class  of 
fold  that  is  well  represented  in  the  database  upon  which  the  method  is  parameterized,  and 


27 


poorly  on  classes  of  fold  that  are  poorly  represented  in  the  same  database  (Benner  et  al. 
1997).  Inspection  of  the  statistical  parameters  themselves  shows  evidence  of  this  bias. 

For  example,  the  GOR  parameter  for  a coil  structure  correlates  both  with  the 
hydrophobicity  index  (Fauchere  et  al.  1988)  and  with  observed  side  chain  accessibility  of 
the  individual  amino  acids.  This  correlation  presumably  reflects  the  fact  that  both  coils 
and  hydrophilic  amino  acids  are  found  preferentially  on  the  surface  of  proteins  within  the 
set  of  protein  used  parameterize  the  GOR  method  (Rose  1978). 

Similarly,  the  strongest  predictor  of  the  GOR  strand  propensity  is  hydrophobicity 
and  interior  position.  That  might  be  expected  given  that  strands  lie  preferentially  inside 
the  globular  structures  found  in  the  original  databases  used  to  parameterize  the  GOR 
method  in  the  1970’s.  Only  the  helix  parameter  lacks  a correlation  with  hydrophobicity. 
This  might  be  interpreted  as  reflecting  the  fact  that  in  the  historical  crystallographic 
database,  a majority  of  the  helices  lie  on  the  surface  of  globular  folds,  with  part  of  their 
residue  side  chains  pointing  out  to  solvent  and  part  pointing  in  toward  the  hydrophobic 
core.  These  correlations  suggest  at  least  the  possibility  that  the  observed  propensities 
reflect  in  part  tertiary  structural  influences  on  secondary  structure  rather  that  intrinsic 
propensities  of  specific  side  chains. 

Experimental  Host-Guest  Approach 

Host-guest  experiments  offer  a tool  to  validate,  in  the  laboratory,  inferences  drawn 
from  an  analysis  of  propensities  in  natural  proteins.  This  approach  measures  the  effect  of 
individually  introducing  each  amino  acid  as  a guest  into  a standard  host  peptide  sequence 
that  forms  (for  example)  a reference  helix,  determining  the  stability  of  the  helix  with  the 
guest,  and  then  concluding  that  different  stabilities  with  different  guests  reflect  intrinsic 
differences  in  the  ability  of  the  guest  amino  acid  to  stabilize  a helix. 


28 


As  a result  of  such  studies,  a quantitative  ranking  of  the  propensity  of  each  amino 
acid  to  adopt  an  alpha  helical  or  beta  sheet  conformation  is  obtained  (Chakrabartty  and 
Baldwin  1995,  Smith  and  Regan  1997).  Of  key  importance  are  the  selection  of  the  host 
peptide  sequence  and  the  local  environment  of  the  guest  site.  If  such  measurements  are 
to  be  meaningful,  the  aim,  in  so  far  as  it  is  experimentally  feasible,  is  to  isolate  the  guest 
residue  from  interactions  with  neighboring  residues.  If  true  isolation  could  be  achieved, 
then  the  experiment  would  measure  only  the  free  energy  associated  with  the  unfolded-to- 
helix  transition  as  it  is  influenced  by  the  guest. 

To  date,  several  different  hosts  have  been  employed,  with  both  block  copolymers 
(Wojcik  et  al.  1990),  short,  designed  peptides  (Kallenbach  et  al.  1996,  O'Neil  and 
DeGrado  1990,  Park  et  al.  1993,  Rohl  and  Chakrabartty  1996)  and  small  natural  proteins 
(Blaber  et  al.  1993,  Horovitz  et  al.  1992).  The  different  host  systems  have  generated 
similar,  but  not  identical,  thermodynamic  scales  for  the  a-helix  forming  tendencies  of 
different  amino  acids.  The  range  of  free  energies  between  the  best  and  worst  alpha  helix 
forming  residues  differs  quite  significantly  between  the  different  studies,  however.  This 
is  presumably  due  to  a failure  of  one  of  the  approximations  made  in  the  model,  including 
the  failure  to  achieve  complete  isolation. 

Surveying  the  inconsistencies  in  these  experimental  data,  we  conclude  that  either 
the  propensitites  of  different  amino  acids  to  adopt  different  secondary  structural  types  are 
not  large,  or  that  the  individuality  of  the  amino  acids  is  not  well  captured  in  the  host- 
guest  experiment. 

A middle  point  of  view,  arguing  that  both  tertiary  contacts  and  local  sequence 
contribute  to  the  formation  of  local  conformation,  is  likely  to  be  the  most  plausible.  In 


29 


this  view,  the  only  question  asks  about  the  relative  contributions  of  distant  and  local 
factors.  To  understand  the  relative  contribution,  we  began  by  recognizing  that  the 
database  contains  different  types  of  helices,  most  notably  those  inside  the  fold,  unexposed 
to  solvent  water,  and  those  outside.  The  inside  of  a protein  looks  more  like  hexane  than  it 
does  water.  There  is  no  reason  from  first  chemical  principles  to  expect  that  the 
conformation  of  anything  when  dissolved  in  hydrocarbon  will  behave  the  same  as  when 
dissolved  in  water.  This  is  especially  true  with  polypeptides,  where  the  dominant 
structural  feature  is  a repeating  dipole  and  hydrogen  bonding. 

Further,  surface  helices  are  in  an  environment  that  is  strongly  anisotropic  with 
respect  to  solvation  potential.  We  might  expect,  therefore,  that  helices  that  lie  on  the 
surface  will  exploit  this  anisotropy  to  achieve  a stable  local  conformation,  placing 
hydrophobic  residues  in,  and  hydrophilic  residues  out.  Helices  that  lie  entirely  within  the 
fold  of  a protein  cannot  exploit  this  anisotropy.  This  observation  suggested  that  interior 
helices,  because  they  cannot  use  differential  solvation  to  stabilize  themselves,  will  have 
residues  that  reflect  better  their  intrinsic  helix-forming  propensities.  This  signal  of 
intrinsic  propensity  would  have  been  lost  by  aggregating  data  from  helices  of  different 
types,  in  particular,  internal  and  external  helices,  and  core  and  non-core  helices. 

Methods  and  Materials 

A program  was  written  to  extract  all  of  the  helices  and  strands  from  the  database. 
These  secondary  structure  segments  were  grouped  based  on  their  DSSP  defined  interior 
and  surface  exposures.  This  determination  was  based  on  the  surface  accessibility 
parameter.  The  surface  accessibility  parameter  is  a single  digit,  where  the  number  is  10 
times  the  fraction  surface  accessibility  that  is  measured  for  the  side  chain  rounded  down 
the  nearest  tenth.  In  the  case  that  the  fraction  of  side  exposure  is  greater  than  1 .0,  the 


30 


number  9 was  recorded.  For  each  of  these  interior  and  surface  exposure  groupings,  the 
number  of  core  and  non-core  secondary  structure  segments  was  determined. 

The  conformational  preference  CP(j,k)  of  an  amino  acid  of  type  j for  a secondary 
structure  k is  defined  as  the  ratio  of  the  probability,  Pjjc,  of  finding  the  j residue  in 
secondary  structure  k to  the  probability,  Pj,  of  finding  the  j amino  acid  anywhere  in  the 
protein  sequence  (Chou  and  Fasman  1974): 


where  » . k is  the  number  of  residues  of  type  j in  secondary  structure  k , and 

p'=77'  (3‘3: 

where  tij  is  the  number  of  residue  of  type  j in  all  of  the  sequences,  and  N is  the  total 
number  of  residues.  It  is  worth  mentioning  that  CP(j,k ) is  not  a probability;  it  measures 
the  bias  of  finding  the  amino  acid  type  j in  state  k , compared  with  the  average  occurrence 
of  any  type  of  amino  acid  in  state  k.  As  such,  CP(j,k ) will  take  values  > 1 for  residues 
that  favor  conformation  k,  and  < 1 otherwise.  The  k conformational  state  in  proteins  is 
the  a helix,  /?  strand,  or  random  coil.  In  addition  to  the  propensity  the  average  length, 
average  surface  accessibility  and  the  number  of  core  and  non-core  segments  were 
calculated  for  each  group. 


(3-1) 


in  which 


(3-2) 


31 


Results 

Table  3-1  collects  the  length  parameter  for  the  helices  in  the  database.  This 
provides  an  overview  of  the  length  distribution  of  the  helices.  Columns  3 and  4 indicate 
that  there  are  a greater  number  of  internal  residues  than  surface  residues  within  the 
helices  represented  in  this  database.  The  plot  in  Figure  3-1  shows,  however,  that  the 
distribution  is  quite  broad,  with  no  particular  length  being  dominant. 


400 


350 


300 

C/2 

Q 

g 250 

"3 

X 

o 200 

Ui 

<L> 

X) 

| 150 
Z 

100 
50 
0 

5 6 7 8 9 10  11  12  13  14  15  16  17  18 

Helix  Length 

Figure  3-1.  Plot  of  number  of  helices  versus  helix  length  in  the  reference  protein 
database. 

Table  3-2  shows  the  propensities  of  the  amino  acids  within  helices  where  all  of  the 
side  chains  are  internal.  These  helices  are  defined  as  having  only  internal  residues 
occupying  all  of  the  sites  in  the  segment.  In  this  database  there  are  1400  of  these  helices. 
The  “relative”  column  is  calculated  by  dividing  the  absolute  value,  the  number  of 
observed  occurrences,  of  a particular  amino  acid  by  the  sum  of  the  absolute  values  for  all 


32 


of  the  amino  acids.  This  value  is  then  divided  by  the  frequency  of  the  respective  amino 
acid  in  order  to  determine  the  propensity  of  the  given  amino  acid.  This  table  indicates 
that  amino  acids  that  are  most  likely  to  be  found  in  this  type  of  helix  are  methionine, 


leucine  and  alanine. 


Table  3-1 . Tabulation  of  the  number  of  helices,  the  number  of  internal  and  surface 
residues  and  the  percentage  of  helices  that  follow  a helical  wheel 


Helix 

length 

Number  of 
helices 

Internal 
residues  per 
helix 

Surface 
residues  per 
helix 

% of  Helices  that 
follow  a helical 
wheel 

5 

288 

3 

.77 

1 

.23 

91 

6 

358 

4 

.27 

1 

.73 

84 

7 

346 

5 

.19 

1 

.81 

80 

8 

319 

6 

.23 

1 

.77 

77 

9 

335 

7 

.10 

1 

.90 

73 

10 

336 

7 

.73 

2 

.27 

66 

11 

364 

8 

.83 

2 

.17 

65 

12 

284 

10 

.04 

1 

.96 

62 

13 

300 

10 

.61 

2 

.39 

48 

14 

278 

11 

.42 

2 

.58 

45 

15 

177 

12 

CO 

co 

2 

.62 

42 

16 

151 

13 

.05 

2 

.95 

38 

17 

179 

14 

.05 

2 

.95 

37 

18 

100 

14 

.80 

3 

.20 

41 

Table  3-3  shows  the  propensities  of  the  amino  acids  that  lie  within  helices  that  do 
not  have  all  of  the  side  chains  internal.  In  this  database  there  are  4497  of  these  helices. 
This  table  indicates  that  amino  acids  that  are  most  likely  to  be  found  in  this  type  of  helix 
are  glutamic  acid,  glutamine,  and  alanine.  Comparing  this  with  helices  having  all  of  their 
residues  internal,  one  feature  is  evident;  the  hydrophilic  residue  propensities  increase. 
This  is  most  notable  in  the  cases  of  glutamic  acid  and  lysine. 

Comparison  with  the  Chou  and  Fasman  (1974)  propensities  is  also  interesting. 

Leu,  Ala  and  Glu  are  the  three  strongest  helix  formers  according  to  Chou  and  Fasman. 
Only  Val,  Cys  and  Thr  are  Chou-Fasman  strand-formers.  Gly  is  commonly  found  in 


33 


turns  and  other  loops  where  it  allows  specific  conformations  to  be  attained  (owing  to  the 
absence  of  a side-chain).  Table  3-4  shows  the  propensities  by  Chou  Fasman. 


Table  3-2.  Propensity  of  the  amino  acids  within  helices  where  all  side  chains  are  internal 
(1400) 


Prop 

Rel 

Freq 

Abs 

M 

1.39 

2.90 

2.08 

323 

L 

1.32 

12.34 

9.33 

1376 

A 

1.29 

11.21 

8.70 

1250 

W 

1.27 

1.78 

1.40 

199 

C 

1.16 

1.25 

1.08 

139 

F 

1.12 

4.47 

3.98 

499 

Y 

1.12 

3.81 

3.39 

425 

H 

1.09 

2.55 

2.34 

285 

I 

1.05 

5.97 

5.67 

666 

S 

0.99 

5.77 

5.81 

644 

R 

0.99 

5.14 

5.19 

573 

Q 

0.92 

3.58 

3.92 

399 

V 

0.91 

6.61 

7.22 

737 

E 

0.88 

5.87 

6.69 

655 

G 

0.87 

6.26 

7.23 

698 

T 

0.86 

4.66 

5.39 

520 

P 

0.80 

3.66 

4.60 

408 

D 

0.78 

4.57 

5.89 

510 

N 

0.77 

3.28 

4.28 

366 

K 

0.75 

4.33 

5.81 

483 

Data  given  for  internal  helices.  For  each  amino  acid,  “Prop”  is  the  amino  acid  propensity  defined 
as  the  probability  that  a given  residue  lies  in  an  internal  helix  divided  by  the  probability  that  any 
residue  lies  in  a given  secondary  structure.  “Rel”  is  the  reliability  of  the  assignment,  the 


probability  (in  %)  of  the  indicated  amino  acid  will  be  found  in  the  indicated  secondary  structure. 
“Freq”  represents  the  percent  amino  acid  frequency  of  the  protein  database.  “Abs”  indicates  the 
number  of  times  the  indicated  residue  is  found  in  an  internal  helix  in  the  database  examined. 

Discussion 

The  ability  of  certain  amino  acids  to  form  hydrogen  bonds  is  important  for  the 
formation  of  secondary  structures.  There  is  extensive  hydrogen  bonding  in  the  alpha 
helix.  Alanine,  glutamine,  and  leucine  residues  reportedly  favor  alpha  helices  (Stryer 
1995).  Serine,  aspartate,  and  asparagine  disrupt  alpha  helices  because  their  side  chains 
contain  a hydrogen-bond  donor  and  acceptor  in  close  proximity  to  the  backbone,  where 
they  compete  for  main-chain  NH  and  CO  groups  (Stryer,  1995). 


34 


Table  3-3. 

Propensity  of  the  amino  acids  within  helices  where  not  all  side  chains  are 
internal  (4497) 

Prop 

Rel 

Freq 

Abs 

E 

1.33 

8.88 

6.69 

4066 

Q 

1.29 

5.06 

3.92 

2318 

A 

1.27 

11.06 

8.70 

5067 

K 

1.20 

6.97 

5.81 

3192 

L 

1.20 

11.15 

9.33 

5105 

R 

1.18 

6.12 

5.19 

2801 

M 

1.11 

2.30 

2.08 

1051 

I 

0.99 

5.63 

5.67 

2577 

W 

0.98 

1.37 

1.40 

628 

Y 

0.92 

3.12 

3.39 

1428 

F 

0.91 

3.64 

3.98 

1669 

D 

0.90 

5.31 

5.89 

2430 

N 

0.89 

3.79 

4.28 

1735 

V 

0.87 

6.25 

7.22 

2861 

H 

0.87 

2.03 

2.34 

931 

S 

0.86 

4.99 

5.81 

2285 

c 

0.85 

0.92 

1.08 

423 

T 

0.81 

4.38 

5.39 

2006 

G 

0.60 

4.31 

7.23 

1975 

P 

0.59 

2.72 

4.60 

1246 

Data  given  for  internal  helices.  For  each  amino  acid,  “Prop”  is  the  amino  acid  propensity  defined 
as  the  probability  that  a given  residue  lies  in  an  internal  helix  divided  by  the  probability  that  any 
residue  lies  in  a given  secondary  structure.  “Rel”  is  the  reliability  of  the  assignment,  the 


probability  (in  %)  of  the  indicated  amino  acid  will  be  found  in  the  indicated  secondary  structure. 
“Freq”  represents  the  percent  amino  acid  frequency  of  the  protein  database.  “Abs”  indicates  the 
number  of  times  the  indicated  residue  is  found  in  an  internal  helix  in  the  database  examined. 

Valine  and  isoleucine  also  disfavor  the  formation  of  the  alpha-helix,  but  for  a 


different  reason;  the  branching  at  the  beta  carbon  of  their  side  chains  sterically  hinders 
alpha-helix  formation.  This  is  immediately  evident  from  the  data.  Consider  the  increase 
in  propensity  when  comparing  helices  where  all  of  the  side  chains  are  internal  compared 
to  those  that  are  not.  We  notice,  as  expected,  that  the  propensity  of  the  hydrophobic 
amino  acids  rise.  However,  valine  and  isoleucine  have  very  little  increase  in  propensity. 


This  is  illustrated  in  the  Table  3-5.  Alanine  has  no  increase  in  propensity,  however,  it  is 


important  to  note  that  the  propensity  of  alanine  in  helices  is  high  to  begin  with. 


Proline  disfavors  alpha  helix  formation  because  of  steric  hindrance  and  that  it  lacks 
an  amide  H atom  for  hydrogen  bonding.  The  hydrogen  side  group  in  glycine  allows  too 


35 


much  flexibility  around  the  alpha  carbon,  and  thus  glycine  is  also  not  frequently  found  in 


alpha  helices  (Lodish  et  al.  1995). 


Table  3-4.  Chou-Fasman  (1974)  assignment  of 

amino  acid  propensities 

Helical 

residues 

Palpha 

Beta  sheet 
residues 

Pbeta 

E 

1.53 

M 

1.67 

A 

1.45 

V 

1.65 

L 

1.34 

I 

1.60 

H 

1.24 

c 

1.30 

M 

1.20 

Y 

1.29 

Q 

1.17 

F 

1.28 

W 

1.14 

Q 

1.23 

V 

1.14 

L 

1.22 

F 

1.12 

T 

1.20 

K 

1.07 

W 

1.19 

I 

1.00 

A 

0.97 

D 

0.98 

R 

0.90 

T 

0.82 

G 

0.81 

S 

0.79 

D 

0.80 

R 

0.79 

K 

0.74 

C 

0.77 

S 

0.72 

N 

0.73 

H 

0.71 

Y 

0.61 

N 

0.65 

P 

0.59 

P 

0.62 

G 

0.53 

E 

0.26 

Table  3-5.  Change  in  propensity  from  internal  to  non-internal  helices 


Residue 

Prop  not  internal 

Prop  internal 

A Prop 

M 

1.19 

1.39 

0.20 

Y 

0.97 

1.12 

0.15 

F 

0.99 

1.12 

0.13 

W 

1.16 

1.27 

0.11 

L 

1.26 

1.32 

0.06 

I 

1.01 

1.05 

0.04 

V 

0.89 

0.91 

0.02 

A 

1.29 

1.29 

0 

Beta  sheets  are  composed  of  beta  strands.  Each  beta  strand  is  about  5-8  residues 


long  and  the  backbone  atoms  of  each  residue  are  able  to  hydrogen  bond  (Lodish  et  al. 
1995).  Through  association  via  hydrogen  bonding,  individual  strands  are  able  to  form 
beta  sheets,  with  the  side  chains  of  residues  protruding  perpendicular  to  the  sheets. 


36 


Turns  are  compact,  U-shaped  structures  composed  of  three  or  four  residues 
stabilized  by  a hydrogen  bond  between  their  end  residues.  The  turns  are  located  on 
protein  surfaces  and  form  a sharp  bend  that  redirects  the  polypeptide  backbone  back 
toward  the  hydrophobic  interior.  Glycine  and  proline  tend  to  favor  the  formation  of 
turns. 

Prediction  of  Internal  Helices 

It  has  been  a decade  since  the  first  convincing  tools  emerged  to  predict  secondary 
structure  from  a set  of  aligned  homologous  sequence  data.  These  methods  differ  from 
homology  modeling  or  profile-based  tools  to  predict  structure.  The  latter  require  that  a 
family  of  proteins  include  at  least  one  member  whose  crystal  structure  has  been 
experimentally  obtained.  In  contrast,  evolution-based  methods  examine  patterns  of 
variation  and  conservation  of  amino  acids  within  a set  of  proteins  diverging  under 
functional  constraints,  extracting  information  about  the  fold  of  the  protein  family  from 
the  difference  in  the  pattern  of  replacement  actually  observed  with  the  patterns  predicted 
by  simple  stochastic  models  for  sequence  divergence  that  treat  proteins  as  a formless, 
functionless  string  of  letters. 

Considerable  success  has  been  achieved  by  evolution  based  methods  in  identifying 
long  helices  that  lie  on  the  surface  of  the  protein.  These  frequently  reveal  their  presence 
by  a 3.6  residue  pattern  of  internal  and  surface  sites.  While  this  has  been  tested  many 
times  in  bona  fide  prediction  environments,  there  has  never  been  a retrodictive  test  of  this 
against  a modem  database.  This  was  the  first  goal  of  our  effort. 

Further,  evolution  based  tools  have  failed,  sometimes  notably,  to  identify  short 
helices  and  internal  helices.  Helices  lacking  at  least  one  turn  are  too  short  to  have  a 3.6 
site  pattern  be  statistically  significant  above  random  segments  holding  internal  and 


37 


surface  assignments.  Further,  short  helices  and  coils  are  not  very  distinct.  This  is 
especially  true  for  helices  that  are  4-7  residues  in  length,  where  different 
crystallographers  looking  at  the  same  coordinate  data  may  disagree  as  to  whether  a helix 
is  present  or  not. 

For  internal  helices,  the  difficulties  are  more  fundamental.  An  internal  helix  is 
characterized  as  a string  of  internal  positions.  No  3.6  site  periodicity  of  the  type  expected 
for  a surface  helix  is  possible  in  this  case,  although  3.6  site  periodicity  of  other  types  has 
been  found  by  Rees  and  others  for  membrane  helices  (Stowell  and  Rees  1995). 

The  issue  has  become  intriguing  because  Rost  (Rost  and  Sander  1994)  has 
suggested  that  the  neural  network  implementation  of  the  Benner-Gerloff  method  (the 
PHD  program)  can  detect  internal  helices.  The  figure  that  accompanies  this  suggestion 
(Fig.  3-2)  indicates  the  contrary,  with  internal  helices  being  identified  only  ca.  40%  of  the 
time,  with  the  best  prediction  (as  expected)  being  for  helices  with  20-50%  surface 
exposed.  Further,  it  was  not  clear  how  the  per-site  prediction  accuracy  could  be  constant 
over  a range  where  the  per  segment  prediction  accuracy  varied.  Last,  it  is  not  clear  in  this 
unrefereed  disclosure  whether  the  test  set  included  structures  that  were  used  to 
parameterize  the  PHD  tool.  Indeed,  the  conclusion  presented  by  Rost,  that  the  best 
predictions  are  obtained  for  helical  segments  having  intermediate  values  of  accessibility, 
is  exactly  what  one  would  expect  from  an  understanding  of  the  method  for  prediction 


based  on  this  tactic. 


38 


PHD  on  705 sequence -unique  proteins 


■£> 

O 

CD 

O 

J! 


s 

S-. 

O 

cn 

CD 

U 

■ 

l—i 

J! 

o-i 

O 


<U 

a 

CD 


—m — correctly  pre- 
dicted helices 

— ^ — helices  not 
predicted 

— g — residues  pre- 
dicted arid 
observed  in  H 

- -v-  - residues  pre- 
dicted inH, 
observed  in  E 

residues  pre- 
dicted inL, 
observed  in  H 


0 20  40  60  80  100 

Relative  solvent  accessibility  (averaged  of  observed  helices) 


Figure  3-2.  Accuracy  of  predicting  residues  in  helices  (or  entire  helical  segments)  is 
plotted  vs.  the  observed  relative  solvent  accessibility  (according  to  DSSP 
assignments).  Values  for  per-residue  accuracy  are  averaged  over  all  residues 
with  a given  relative  solvent  accessibility.  Values  for  per-segment  accuracy 
are  averaged  over  all  helical  segments  with  a given  average  accessibility. 

We  then  asked  whether  the  helices  themselves  could  be  detected  from  patterns  in 
the  experimentally  assigned  surface/interior  accessibilities  using  the  binary  pattern  of 
assignments  introduced  by  Benner  and  Gerloff  (1991).  First,  we  made  a list  of  all 
possible  surface/interior  patterns,  which  are  2”,  where  n is  the  length  of  the  pattern.  We 
then  divided  them  into  two  groups,  those  that  were  consistent  with  a helical  structure,  and 
those  that  were  not.  The  final  column  of  Table  3-1  provides  the  percentage  of  helices,  of 
a given  length,  that  follow  this  type  of  helical  pattern. 

All  of  the  secondary  structure  segments  were  queried  to  identify  which  of  the 
internal  helices  and  internal  strands  were  interrupted  in  any  of  the  proteins  by  a gap  in  the 


39 


multiple  sequence  alignment  (MSA).  For  the  interruption  to  be  accepted,  the  gapped 
sequence  must  contain  sequence  segments  before  and  after  the  gap  that  aligned  with  the 
remainder  of  the  protein  sequence.  This  required  reference  to  our  resource  of  multiple 
sequence  alignments.  As  a result,  we  were  able  to  establish  two  sets  of  helices  and 
strands,  respectively,  those  that  are  not  gapped,  and  those  that  are.  The  first  will  be  called 
“core,”  the  second  will  be  called  “non-core.”  There  were  984  core  helices  and  277  non- 
core helices. 

We  then  asked  whether  the  average  accessible  surface  area  (ASA)  for  non-core 
helices  is  higher  than  the  average  ASA  for  core  helices.  This  was  found  not  to  be  true. 
However,  the  result  was  found  not  to  be  significant.  The  average  ASA  for  non-core 
helices  was  found  to  be  0.67,  while  the  average  ASA  for  core  helices  was  found  to  be 
0.70. 

We  also  tested  another  hypothesis  based  on  correlated  features  of  the  helix  with 
their  core/non-core  assignment  based  purely  on  gapping:  the  average  length  of  a non-core 
helix  is  shorter  than  the  average  core  helix.  This  hypothesis  was  not  confirmed  by  the 
data.  The  average  length  for  non-core  helices  was  found  to  be  9.2  residues,  while  the 
average  length  for  core  helices  was  found  to  be  7.5  residues.  These  results  are 
summarized  in  Table  3-6. 

We  have  examined  here  in  a retrodictive  sense  a set  of  proteins  that  have  known 
crystal  structure,  to  place  limits  on  the  length  and  exposure  of  helices  that  are  likely  to 
reveal  themselves  using  patterns. 


40 


Table  3-6.  Number  of  Core  and  Non-core  internal  helices  as  well  as  their  respective 
accessible  surface  areas  and  lengths  


Internal 

Average  accessible 

Average 

helices 

Number 

surface  area 

length 

Core 

984 

0.70 

7.5 

Non-core 

277 

0.67 

9.2 

Different  classes  of  errors  have  different  impact  on  the  outcome  of  a prediction 
exercise.  Also,  helices  play  different  roles  in  the  fold.  If  one  mistakenly  assigns  a core 
helix  as  a strand,  or  a core  strand  as  a helix,  the  resulting  prediction  is  unlikely  to  be 
useful. 


CHAPTER  4 

IDENTIFYING  PARSES  THAT  SEPARATE  SECONDARY  STRUCTURAL  UNITS: 
IMPLICATIONS  FOR  PREDICTING  PROTEIN  SECONDARY  STRUCTURE 

Introduction 

Much  effort  has  been  devoted  to  developing  approaches  for  assigning  a helices,  (3 
strands,  and  other  standard  secondary  structural  elements  to  a protein  starting  from 
sequence  data  alone  (Guzzo  1965).  Less  attention  has  been  directed  toward  a 
complementary  problem,  identifying  in  a protein  sequence  elements  that  indicate  a break 
in  standard  secondary  structure.  A segment  of  the  polypeptide  that  lies  between  helices 
and  strands,  or  at  a point  where  two  standard  secondary  structural  elements  abut,  is  called 
a parsing  element.  These  punctuate  a sequence  that  might  otherwise  be  hundreds  of 
amino  acids  in  length  into  units  that  are  manageable  and  contain  single  elements  of 
secondary- structure  information. 

The  value  of  tools  that  parse  sequences  was  made  evident  by  the  bona  fide 
predictions  of  secondary  structure  to  tryptophan  synthase  (Crawford  et  al.  1987)  and 
protein  kinase  (Benner  and  Gerloff  1991),  both  announced  before  crystallographic  data 
were  available.  These  two  predictions  proved  by  subsequently  determined  crystal 
structures  to  have  been  remarkably  accurate.  Crawford  and  colleagues  (1987)  predicted 
a:  helices,  fi  strands  and  turns  based  on  two  standard  prediction  algorithms.  This  study 
found  good  agreement  between  predicted  loops  based  on  chain  flexibility  values.  Benner 
and  Gerloff  (1991)  predicted  parsing  regions  based  on  gapping  within  multiple  sequence 


41 


42 


alignments,  as  well  as  more  sophisticated  assignments  of  interior  and  surface  residues 
from  patterns  of  variation  and  conservation  of  homologous  protein  sequences. 

Assigning  parsing  segments  is  possible,  in  principle,  using  empirical  methods  that 
predict  coils.  Such  tools  are  well  known  in  early  work  by  Chou  and  Fasman  (1974)  and 
Gamier  et  al.  (1978).  Tools  that  define  the  ends  of  helices  and  strands  could  also,  in 
principle,  be  used  to  parse  a sequence.  Other  studies  have  attempted  to  predict  parses 
based  on  the  notion  that  a globular  protein  of  a specific  size  must  have  a certain  number 
of  turns  in  order  to  fold  as  a globule  (Rose  and  Wetlaufer  1977,  Rose  1978). 

Cohen  and  coworkers  (1986)  suggested  that  four  consecutive  hydrophilic  amino 
acids  might  indicate  breaks  in  secondary  structure.  This  suggestion  is  based  on  the  fact 
that  turns  and  coils  generally  lie  on  the  surface  of  a protein,  as  do  hydrophilic  amino 
acids.  Some  of  the  useful  tools  developed  to  identify  helix  caps  might,  in  principle,  be 
used  as  parsing  tools.  Some  of  the  more  useful  tools  for  deducing  the  conformation  of  a 
protein  (secondary  and  tertiary  structure)  originate  from  sequence  information  (primary 
structure).  Progress  toward  solving  this  problem  has  come  through  an  analysis  of 
patterns  of  conservation  and  variation  in  the  sequences  of  homologous  proteins. 
Predictions  made  using  this  approach  are  consensus  models  for  the  conformation  of  a 
protein  family;  and  assume  that  proteins  related  by  common  ancestry  have  similar 
conformations  (Chothia  and  Lesk  1986). 

Parsing  tools  divide  a protein  sequence  into  segments  that  form  standard  secondary 
structure  independently.  By  parsing  a sequence,  secondary  structure  predictions  need  to 
consider  only  short  segments  of  the  polypeptide  chain,  which  is  intrinsically  easier  than 
considering  the  polypeptide  chain  as  a whole.  Therefore,  understanding  the  evolution  of 


43 


loops  is  an  important  step  toward  developing  tools  for  predicting  secondary  structure  in 
proteins. 

The  same  polypeptide  segments  may  adopt  different  secondary  structures  when 
embedded  in  different  tertiary  structural  contexts.  This  implies  that  local  sequences  need 
not  determine  all  secondary  structures.  Fortunately,  this  does  not  appear  to  be  the  case 
for  many  sequences  involved  in  coils.  Strings  (consecutive  positions  in  a polypeptide 
chain)  of  Pro,  Gly,  Asp,  Asn,  or  Ser  prove  to  be  good  indicators  of  a parse  in  standard 
secondary  structural  elements  (Chou  and  Fasman  1978).  In  general,  a longer  parsing 
string  is  more  reliable  than  a shorter  parsing  string,  at  predicting  a parse,  and  a string 
containing  more  prolines  is  better  than  one  containing  fewer  prolines.  Thus,  a single  Pro 
in  a sequence  is  not  a reliable  indicator  of  a parse.  In  1995,  Benner  and  colleagues 
(1995)  examined  a pre-genomic  database  and  found  that  a Pro-Gly  sequence  nearly 
always  indicates  a parse,  while  a Gly-Ser-Asn-Ser  sequence  nearly  always  does  as  well. 

In  this  chapter,  we  present  tools  that  can  be  used  to  “parse”  a multiple  alignment 
into  segments  that  represent  secondary  structural  elements.  These  are  applied  to  a post- 
genomic  dataset.  The  methods  are  evaluated  by  examining  their  performance  in  protein 
families  where  exactly  one  member  has  a crystal  structure. 

Methods  and  Materials 

The  heuristics  presented  herein  were  programmed  in  Perl.  The  MasterCatalog 
database  was  manipulated  by  the  MySQL  database  system.  The  full  length  sequences 
were  aligned  using  the  Darwin  system.  The  surface  and  interior  predictions  were  based 
on  the  DSSP  algorithm.  Darwin  features  a number  of  built  in  procedures  for  sequence 
analysis  such  as  an  efficient  sequence  retrieval  mechanism  and  dynamic  programming. 
Because  the  interaction  with  the  Darwin  system  is  made  through  an  interpreter,  it  allows 


44 


for  the  rapid  prototyping  and  testing  of  the  application.  Finally,  Darwin  also  manages 
memory  allocation;  which  is  known  to  be  the  major  source  of  problems  when  developing 
an  application. 

For  this  analysis,  a list  of  proteins  with  one  crystal  structure,  together  with 
secondary  structure  assignments  by  DSSP  (Kabsch  and  Sander  1983),  was  obtained  from 
the  MasterCatalog.  The  MasterCatalog  has  pre-computed  families  that  provided  a 
convenient  way  to  locate  homologs.  This  yielded  a database  contained  1027  sequences 
of  proteins  where  one  crystal  structure  was  available.  This  database  was  used  to  analyze 
single  sequences. 

These  sequences  were  then  used  as  probes  to  search  the  GenBank  129  sequence 
database.  Pairwise  alignments  were  constructed  using  DARWIN  and  the  Smith- 
Waterman  algorithm.  Multiple  alignments  were  constructed  using  the  routine  on 
DARWIN.  In  the  search,  pairwise  alignments  between  the  probe  sequence  and  the 
recovered  sequence  were  made  if  the  recovered  sequence  was  more  than  or  equal  to  1 
PAM  unit  from  the  probe  sequence  and  less  than  or  equal  to  125  PAM  units  distant  from 
the  probe  sequence.  After  culling,  666  families  of  proteins  were  multiply  aligned. 

Coil  in  this  discussion  indicates  a residue  that  is  assigned  neither  as  an  a helix  nor 
as  a strand  in  a protein  structure.  Thus,  it  includes  regions  of  random  coil,  turns,  or 
other  non-standard  secondary  structure.  To  calculate  the  coverage,  the  secondary 
structure  is  first  broken  into  segments,  actually  regions  in  between  secondary  structures, 
then  a segment  is  said  to  be  parsed  if  at  least  one  position  was  assigned  a parse  by  the 
heuristic.  The  measure  of  coverage  is  as  follows:  number  of  parsed  segments  / number  of 


segments  x 1 00. 


45 


Results 


Single  Residue  Parses 

The  notion  that  certain  amino  acids  show  preferences  for  a certain  conformation, 


alpha  helix,  beta  sheet,  and  coil,  is  central  to  the  classical  work  of  Chou  and  Fasman 


(1978).  We  look  here  at  the  propensities  of  each  of  the  20  amino  acids  to  indicate  a parse 
in  the  666  experimental  structures.  The  results  are  shown  in  Table  4-1.  The  results  show 
that  the  probability  that  Pro  and  Gly  will  be  found  in  a parse  is  indeed  higher  than 
average.  This  is,  of  course  nothing  more  than  a Chou-Fasman  analysis  based  on  a larger 
database,  and  provides  no  new  or  surprising  results.  Substantial  numbers  of  Pro  and  Gly 
residues  are,  however,  found  in  helices  and  strands.  Thus,  simple  observation  of  a Pro  or 
Gly  in  a sequence  is  not  the  basis  for  a reliable  prediction  of  a break  in  secondary 


structure  at  this  position.  This  observation  is  also  not  new  (Wilmot  and  Thornton  1988). 


Table  4-1 . Single  residue  parsing  probabilities 


Prop 

Parses 

Rel 

Abs 

Prop 

Helices 

Rel 

Abs 

Prop 

Strands 

Rel 

Abs 

p 

1.49 

60 

78054 

0.64 

24 

31798 

0.72 

16 

20909 

G 

1.34 

54 

121275 

0.69 

26 

58949 

0.90 

20 

45379 

D 

1.19 

48 

80806 

0.93 

35 

58884 

0.76 

17 

29406 

N 

1.19 

48 

58689 

0.90 

34 

41611 

0.81 

18 

21975 

S 

1.12 

45 

76131 

0.93 

35 

59392 

0.94 

21 

35197 

H 

1 . 07 

43 

30179 

0.93 

35 

24738 

0.99 

22 

15720 

C 

1.04 

42 

16885 

0.82 

31 

12491 

1.26 

28 

11202 

T 

1.04 

42 

67421 

0.88 

33 

53176 

1.08 

24 

39058 

K 

1.00 

40 

71155 

1.14 

43 

75140 

0.76 

17 

30098 

R 

0.95 

38 

63309 

1.12 

42 

69631 

0.85 

19 

32182 

Q 

0.92 

37 

39900 

1.20 

45 

48895 

0.85 

19 

20175 

E 

0.92 

37 

72903 

1.22 

46 

91038 

0.81 

18 

35374 

W 

0.92 

37 

12355 

1.01 

38 

12818 

1.12 

25 

8225 

Y 

0.90 

36 

34484 

1.01 

38 

36820 

1.21 

27 

25816 

F 

0.87 

35 

40834 

0.96 

36 

42039 

1.30 

29 

33448 

M 

0.85 

34 

22542 

1.20 

45 

29238 

0.94 

21 

13666 

A 

0.82 

33 

86249 

1.30 

49 

128474 

0.85 

19 

49082 

L 

0.80 

32 

87680 

1.22 

46 

128827 

0.99 

22 

60728 

V 

0.80 

32 

72960 

0.90 

34 

77405 

1.48 

33 

74619 

I 

0.77 

31 

59769 

1.01 

38 

72291 

1.35 

30 

57683 

Data  given  for  parses,  helices,  and  strands.  For  each  category,  “Prop”  is  the  secondary  structure 
propensity  (Chou  and  Fasman,  1978)  defined  as  the  probability  that  a given  residue  lies  in  a 


46 


particular  structure  divided  by  the  probability  that  any  residue  lies  in  this  secondary  structure. 
“Rel”  is  the  reliability  of  the  assignment,  the  probability  (in  %)  of  the  indicated  amino  acid  will 
be  found  in  the  indicated  secondary  structure.  “Abs”  indicates  the  number  of  times  the  indicated 
residue  is  found  in  the  indicated  type  of  secondary  structure  in  the  database  examined. 

Dipeptide  Parses 

We  then  asked  whether  dipeptides  that  place  “structure  disrupting”  amino  acids 
adjacent  in  the  sequence  could  be  used  to  identify  parses  in  the  database  of  known 
structures.  Table  4-2  shows  the  dipeptides  that  are  most  reliable  indicators  of  parses 
listed  in  order  of  decreasing  reliability  as  a parse  indicator. 

The  thirteen  most  reliable  dipeptides  as  parse  indicators  all  contained  Pro. 

However,  the  ability  of  a dipeptide  containing  Pro  to  indicate  a parse  depended  strongly 
on  the  adjacent  amino  acid.  In  general,  a Pro  adjacent  to  a hydrophobic  amino  acid  was 
less  likely  to  indicate  a parse  than  a Pro  adjacent  to  a hydrophilic  amino  acid.  Further, 
Pro  was  extremely  reliable  as  a parse  indicator  when  adjacent  to  another  structure 
disrupting  amino  acid,  Pro,  Gly,  Asn,  Asp,  and  Gly.  Likewise,  the  most  reliable 
dipeptides  as  parse  indicators  not  containing  Pro  are  GC,  GN,  DG,  NG,  and  GG. 
Interestingly,  Asp  and  Gly  appeared  adjacent  to  Pro  with  significantly  higher  frequency 
than  expected  by  the  composition  of  the  database  as  a whole.  This  implies  that  natural 
selection  seeks  proteins  with  this  dipeptide,  presumably  for  structural  reasons. 

The  composition  of  the  dipeptide  in  general  is  the  determining  factor  in  the  ability 
of  a dipeptide  to  signal  a parse,  not  the  order  of  amino  acids  within  the  dipeptide.  In  only 
six  cases  (X  = Ala,  His,  Trp,  Gin,  Glu,  and  Tyr)  were  the  sequences  PX  and  XP  very 
different  in  their  ability  to  indicate  a parse.  In  five  out  of  six  cases,  with  Tyr  being  the 
exception,  the  XP  dipeptide  represents  a higher  probability  as  a parse  indicator.  For  Trp, 
the  number  of  occurrences  is  small,  making  the  difference  not  statistically  significant. 


47 


Table  4-2.  Frequencies  of  dipeptides  in  parses,  helices  and  strands 


Prop 

Parses 

Rel 

Abs 

1 — 

Prop 

Helices 

Rel 

Abs 

Prop 

Strands 

Rel 

Abs 

pp 

1.631 

81 

4238 

0.395 

13 

668 

0.402 

7 

342 

CP 

1.611 

80 

1896 

0.212 

7 

175 

0.746 

13 

306 

DP 

1.591 

79 

6171 

0.395 

13 

1045 

0.459 

8 

588 

WP 

1.591 

79 

946 

0.425 

14 

162 

0.402 

7 

86 

NP 

1.571 

78 

5420 

0.395 

13 

916 

0.459 

8 

575 

HP 

1.571 

78 

3010 

0.364 

12 

448 

0.631 

11 

410 

PG 

1.490 

74 

7627 

0.455 

15 

1560 

0.631 

11 

1099 

PD 

1.470 

73 

6122 

0.577 

19 

1606 

0.402 

7 

618 

PC 

1.450 

72 

1038 

0.364 

12 

178 

0.918 

16 

230 

SP 

1.450 

72 

5206 

0.486 

16 

1141 

0.689 

12 

844 

QP 

1.430 

71 

2981 

0.546 

18 

771 

0.631 

11 

446 

GP 

1.430 

71 

5154 

0.395 

13 

956 

0.918 

16 

1199 

RP 

1.430 

71 

4148 

0.516 

17 

992 

0.689 

12 

721 

GC 

1.410 

70 

2353 

0.577 

19 

633 

0.631 

11 

388 

TP 

1.410 

70 

5769 

0.516 

17 

1449 

0.746 

13 

1065 

GN 

1.410 

70 

5511 

0.516 

17 

1348 

0.746 

13 

1007 

AP 

1.410 

70 

6083 

0.577 

19 

1645 

0.631 

11 

1004 

KP 

1.410 

70 

4971 

0.516 

17 

1233 

0.689 

12 

882 

PK 

1.410 

70 

4325 

0.698 

23 

1388 

0.402 

7 

422 

EP 

1.390 

69 

3984 

0.637 

21 

1233 

0.574 

10 

589 

DG 

1.390 

69 

8524 

0.516 

17 

2087 

0.803 

14 

1737 

PN 

1.390 

69 

3590 

0.607 

20 

1034 

0.631 

11 

590 

PS 

1.390 

69 

5240 

0.577 

19 

1457 

0.689 

12 

942 

PR 

1.369 

68 

3757 

0.728 

24 

1324 

0.459 

8 

417 

NG 

1.369 

68 

6169 

0.607 

20 

1784 

0.689 

12 

1066 

GG 

1.369 

68 

11118 

0.455 

15 

2485 

0.975 

17 

2770 

PT 

1.369 

68 

4920 

0.637 

21 

1500 

0.631 

11 

770 

PF 

1.369 

68 

3335 

0.668 

22 

1087 

0.516 

9 

458 

GD 

1.349 

67 

8045 

0.577 

19 

2303 

0.803 

14 

1616 

LP 

1.349 

67 

8928 

0.668 

22 

2936 

0.631 

11 

1524 

HG 

1.349 

67 

3741 

0.668 

22 

1216 

0.631 

11 

587 

PY 

1.349 

67 

3139 

0.698 

23 

1088 

0.574 

10 

479 

SG 

1.349 

67 

9675 

0.607 

20 

2835 

0.803 

14 

2000 

FP 

1.349 

67 

3519 

0.607 

20 

1054 

0.746 

13 

712 

MP 

1.329 

66 

1862 

0.455 

15 

433 

1.090 

19 

538 

PW 

1.329 

66 

918 

0.759 

25 

350 

0.516 

9 

122 

CC 

1.329 

66 

605 

0.395 

13 

117 

1.262 

22 

200 

NC 

1.329 

66 

1150 

0.668 

22 

391 

0.689 

12 

202 

GK 

1.329 

66 

8944 

0.668 

22 

2937 

0.689 

12 

1689 

TG 

1.329 

66 

8154 

0.637 

21 

2555 

0.746 

13 

1646 

48 


Table  4-2.  (Continued) 


Prop 

Parses 

Rel 

Abs 

Prop 

Helices 

Rel 

Abs 

Prop 

Strands 

Rel 

Abs 

RG 

1.329 

66 

7378 

0.668 

22 

2408 

0.689 

12 

1390 

KG 

1.329 

66 

7624 

0.728 

24 

2764 

0.631 

11 

1232 

PQ 

1.329 

66 

3133 

0.759 

25 

1164 

0.516 

9 

431 

PH 

1.329 

66 

2233 

0.637 

21 

704 

0.803 

14 

471 

NN 

1.329 

66 

3333 

0.759 

25 

1275 

0.459 

8 

418 

GT 

1.309 

65 

7745 

0.607 

20 

2408 

0.861 

15 

1804 

CT 

1.309 

65 

1673 

0.486 

16 

404 

1.148 

20 

505 

DN 

1.309 

65 

3870 

0.789 

26 

1570 

0.516 

9 

520 

GQ 

1.309 

65 

4853 

0.728 

24 

1756 

0.689 

12 

863 

YP 

1.289 

64 

2645 

0.607 

20 

825 

0.918 

16 

682 

SD 

1.289 

64 

5919 

0.759 

25 

2251 

0.631 

11 

1007 

PE 

1.289 

64 

6522 

0.850 

28 

2871 

0.459 

8 

776 

EG 

1.289 

64 

7290 

0.728 

24 

2757 

0.689 

12 

1312 

PL 

1.289 

64 

7039 

0.789 

26 

2855 

0.574 

10 

1136 

GS 

1.289 

64 

8432 

0.637 

21 

2691 

0.861 

15 

1950 

QG 

1.289 

64 

4454 

0.789 

26 

1803 

0.574 

10 

716 

PA 

1.269 

63 

6040 

0.759 

25 

2430 

0.631 

11 

1079 

CG 

1.269 

63 

2303 

0.546 

18 

645 

1.148 

20 

723 

PV 

1.269 

63 

6322 

0.577 

19 

1948 

1.033 

18 

1797 

IP 

1.269 

63 

5469 

0.607 

20 

1726 

0.975 

17 

1490 

SC 

1.249 

62 

1497 

0.698 

23 

554 

0.861 

15 

356 

ND 

1.249 

62 

3575 

0.820 

27 

1533 

0.631 

11 

622 

PI 

1.249 

62 

4108 

0.728 

24 

1567 

0.861 

15 

965 

GE 

1.249 

62 

7955 

0.728 

24 

3048 

0.861 

15 

1929 

NS 

1.249 

62 

3977 

0.759 

25 

1609 

0.746 

13 

832 

GY 

1.249 

62 

4727 

0.668 

22 

1722 

0.918 

16 

1227 

TD 

1.249 

62 

5148 

0.759 

25 

2052 

0.803 

14 

1161 

SH 

1.228 

61 

2260 

0.820 

27 

991 

0.689 

12 

456 

TN 

1.228 

61 

3873 

0.698 

23 

1441 

0.918 

16 

1028 

DD 

1.228 

61 

4909 

0.880 

29 

2349 

0.574 

10 

764 

NK 

1.228 

61 

4072 

0.941 

31 

2087 

0.459 

8 

560 

KC 

1.228 

61 

1108 

0.759 

25 

447 

0.803 

14 

260 

PM 

1.228 

61 

1336 

0.698 

23 

504 

0.918 

16 

355 

DW 

1.228 

61 

1331 

0.911 

30 

665 

0.459 

8 

185 

VP 

1.228 

61 

6048 

0.637 

21 

2048 

1.090 

19 

1847 

Parses  are  defined  as  strings:  ‘CC\  ‘CE\  ‘EC’,  ‘CH\  ‘HC’,  ‘EH’,  ‘HE’:  14171 14  elements. 
Helices  are  defined  as  the  strings:  ‘HH’:  996159  elements. 

Strands  are  defined  as  the  strings:  ‘EE’:  493560  elements. 

The  dipeptides  were  collected  to  the  point  where  all  dipeptides  containing  Pro  were  represented. 
Propensity  is  the  probability  that  a given  dipeptide  will  be  found  in  a particular  parse  divided  by 
the  probability  that  a random  dipeptide  will  be  found  in  this  same  parse.  In  this  collection  of 
proteins,  ca.  50%  of  the  dipeptides  are  parse  positions. 


49 


Gap  Parses 

During  the  divergent  evolution  of  proteins,  segments  of  the  polypeptide  chain  are 
occasionally  inserted  or  deleted.  As  observed  by  Chothia  and  Lesk  (1986)  and  others 
(Pascarella  and  Argos  1992),  these  segments  are  preferably  inserted  between  secondary 
structural  elements.  These  indels  create  gaps  in  a pairwise  alignment.  Classically,  a gap 
in  a multiple  alignment  has  been  assumed  to  indicate  a break  in  the  secondary  structure  of 
both  proteins.  For  example,  Crawford  and  coworkers  assigned  breaks  at  gaps  in  their 
bona  fide  prediction  of  the  conformation  of  tryptophan  synthase  (Crawford  et  al.  1987). 

An  empirical  analysis  of  insertion  and  deletion  during  divergent  evolution  found 
(Gonnet  et  al.  1992)  that  the  probability  of  a gap  of  length  L falls  off  with  L raised  to  the 
1 .7  power.  This  empirical  rule  holds  for  gaps  ranging  from  1 to  over  60  amino  acids,  and 
is  consistent  with  three  hypotheses:  (a)  that  insertions  and  deletions  insert  or  extract 
sequences  whose  ends  are  near  in  space  in  the  folded  form  of  the  protein,  (b)  that  the 
inserted  or  deleted  segment  forms  a random  coil,  and  (c)  that  the  rules  governing  the 
conformation  of  random  coils  free  in  solution  are  the  same  as  those  governing  the 
conformation  of  random  coils  embedded  in  folded  proteins.  While  these  hypotheses 
remain  speculative,  they  provide  a theoretical  context  in  which  to  use  gaps  in  a multiple 
alignment  to  assign  breaks  in  standard  secondary  structural  elements. 

To  establish  the  value  of  a parsing  heuristic  based  on  the  identification  of  gaps,  we 
next  systematically  examined  the  probability  that  a gap  indicated  a break  in  secondary 
structure.  Regions  in  the  target  sequence  were  assigned  as  coils  (neither  alpha  helix  nor 
beta  strand)  when  they  were  paired  in  against  a gap  in  at  least  one  other  aligned 
homologous  sequence  within  the  designated  PAM  window. 


50 


Tables  4-3  - 4-5  show  the  number  of  amino  acid  residues  in  the  target  sequence 
correctly  and  incorrectly  assigned  as  coils  using  the  gap  heuristic  at  varying  PAM 
lengths.  These  data  should  be  considered  in  light  of  certain  sampling  biases.  First,  the 
PAM  distance  determines  the  number  and  quality  of  the  parsing  assignments  made  using 
this  heuristic.  As  discussed  quantitatively  elsewhere  (Benner  et  al.,  1993),  the  probability 
of  a gap  rises  with  PAM  distance.  Thus,  the  number  of  parsing  elements  assigned  by 
comparing  two  similar  proteins  by  a gap  heuristic  is  likely  to  be  small. 

Second,  with  increasing  PAM  distances,  the  reliability  of  a parse  assigned  by  a gap 
decrease  dramatically.  There  are  two  reasons  for  this.  First,  the  assumption  that 
homologous  proteins  have  the  same  conformation  begins  to  break  down  at  higher  PAM 
distances.  Second,  the  quality  of  the  alignment  decreases  with  higher  PAM  distances. 


Table  4-3.  Reliability  of  gaps  as  an  indicator  of  parsing  elements 


PAM 

window 

Gap 

positions 

Parses 

Abs 

% 

Helices 
Abs  % 

Strands 
Abs  % 

0-20 

27 

20 

74 

3 

11 

4 

15 

20-40 

65 

45 

69 

8 

12 

12 

18 

40-60 

408 

229 

56 

75 

18 

104 

25 

60-80 

769 

372 

48 

293 

38 

104 

14 

80-100 

1003 

518 

52 

263 

26 

222 

22 

100-120 

2602 

1291 

50 

897 

34 

414 

16 

120-140 

4920 

2379 

48 

1739 

35 

802 

16 

140-160 

5414 

2414 

45 

1988 

37 

1012 

19 

160-180 

3044 

1363 

45 

1018 

33 

663 

22 

180-200 

707 

322 

46 

193 

27 

192 

27 

Table  4-3  shows  the  number  of  gaps  found  in  the  multiple  sequence  alignments  for 


the  respective  PAM  ranges  and  the  number  of  times  a gap  was  aligned  to  the  respective 
secondary  structural  segment.  As  we  can  see,  at  small  PAM  distances,  the  gaps  have  a 
higher  preference  to  be  aligned  to  a parse. 

Table  4-4  shows  the  percentages  of  the  gaps  that  are  aligned  to  secondary  structural 


segments  based  on  the  individual  amino  acids.  Consider  all  of  the  sites  in  the  reference 


51 


structure  that  are  labeled  with  as  a parse,  Table  4-4  displays  the  percentage  of  those  that 
are  matched  against  a gap.  As  we  can  see  from  Table  4-4,  there  are  very  few  secondary 
structural  elements  that  are  matched  against  a gap  at  low  PAM  distances. 


Table  4-4,  Coverage,  by  residue,  of  gaps  as  an  indicator  of  parsing  elements 


PAM 

window 

Res 

Parses 

Found 

% 

Res 

Helices 

Found 

% 

Res 

Strands 

Found 

% 

0-20 

1471 

20 

1 

882 

3 

0 

802 

4 

0 

20-40 

1946 

45 

2 

1630 

8 

0 

906 

12 

1 

40-60 

3024 

229 

8 

2581 

75 

3 

1352 

104 

8 

60-80 

5171 

372 

7 

4665 

293 

6 

2442 

104 

4 

80-100 

5914 

518 

9 

5388 

263 

5 

3304 

222 

7 

100-120 

10035 

1291 

13 

9000 

897 

10 

5513 

414 

8 

120-140 

15400 

2379 

15 

14254 

1739 

12 

7923 

802 

10 

140-160 

10364 

2414 

23 

10610 

1988 

19 

5762 

1012 

18 

160-180 

5427 

1363 

25 

5391 

1018 

19 

2994 

663 

22 

180-200 

1384 

322 

23 

940 

193 

21 

901 

192 

21 

Table  4-5  shows  the  percentages  of  the  gaps  that  are  aligned  to  secondary  structural 


elements  based  on  the  elements  themselves.  In  this  case,  consider  all  of  the  secondary 
structural  elements  in  the  database  and  Table  4-5  displays  the  percentage  of  those  that  are 


matched  against  a gap.  This  table  clearly  tells  us  that  the  percentage  of  finding  a gap 
aligned  with  a secondary  structural  element  is  much  higher  at  higher  PAM  distances. 


Table  4-5.  Coverage,  by  residue,  of  gaps  as  an  indicator  of  parsing  elements 


PAM 

window 

Res 

Parses 

Found 

% 

Res 

Helices 

Found 

% 

Res 

Strands 

Found 

% 

0-20 

280 

13 

5 

91 

7 

8 

194 

2 

1 

20-40 

394 

28 

7 

176 

4 

2 

227 

5 

2 

40-60 

603 

77 

13 

271 

30 

11 

339 

36 

11 

60-80 

1050 

166 

16 

479 

91 

19 

584 

55 

9 

80-100 

1256 

223 

18 

536 

87 

16 

725 

95 

13 

100-120 

2106 

492 

23 

979 

258 

26 

1173 

209 

18 

120-140 

3253 

943 

29 

1477 

523 

35 

1829 

392 

21 

140-160 

2282 

837 

37 

1088 

542 

50 

1250 

406 

32 

160-180 

1183 

461 

39 

552 

266 

48 

643 

250 

39 

180-200 

305 

114 

37 

109 

53 

49 

195 

72 

37 

52 


Conclusion 

In  this  chapter  contemporary  data-mining  techniques  were  applied  to  learn  how  the 
sequences  of  natural  proteins  might  be  analyzed  to  predict  “parses,”  breaks  in  standard 
secondary  structural  units  (alpha  helices  and  beta  strands)  in  the  folded  form.  Specific 
strings  of  consecutive  amino  acids  in  a polypeptide  chain  were  identified  that  improved 
assignments  of  parses  based  on  single  residue  analysis.  Further,  specific  patterns  of 
variation  and  conservation  improve  prediction  success  when  using  a multiple  alignment 
as  input. 

Empirical  analysis  of  the  data  revealed  striking  properties  of  the  placement  of 
amino  acids  within  secondary  structure.  For  single  and  dipeptide  residues  it  was 
determined  that  proline  preferred  coils  over  helices  and  strands.  Proline  is  almost  always 
exposed  at  the  surface  of  globular  proteins,  where,  due  to  the  non-polar  characteristic  of 
its  ring  structure,  it  exhibits  a hydrophobic  spot.  Proline  does  not  fit  into  the  regular  part 
of  either  helix  or  sheet  structures  because  it  does  not  have  a backbone-NH  available  to 
take  part  in  an  H-bonding.  Proline  gives  the  backbone  a special  rigidity  (fixed  0 torsion 
angle  at  -60°,  Ca-N).  In  the  helix  center,  the  ring  pushes  away  the  preceding  (N- 

o 

terminal)  turn  of  the  helix  by  ~1  A producing  a 30°  bend  and  breaking  the  next  H-bond  as 
well.  An  example  is  shown  in  Figure  4-1 . 

These  results  show  that  by  observing  both  consecutive  strings  and  by  examining 
multiple  sequence  alignments,  tools  can  be  developed  that  are  useful  to  predict  breaks  in 
the  secondary  structure  of  proteins.  Although  the  method  for  predicting  secondary 
structure  from  multiple  alignments  depends  upon  the  quality  of  the  alignment,  we  found 
most  of  the  heuristics  presented  herein  quite  tolerant.  As  with  all  parameterized  tools,  it 


53 


is  possible  that  the  tool  developed  here  might  be  biased  to  reflect  a particular  dataset.  Of 
course  these  results  are  valid  only  for  the  dataset  presented  herein,  and  the  validation  of 
these  rules  will  come  through  the  making  of  bona  fide  secondary  structure  predictions. 


(a) 


(b) 


Figure  4-1.  A strand  of  protein  in  the  a)  linear,  fully  extended  conformation,  and  (b)  the 
same  strand  with  a proline  in  the  middle. 


CHAPTER  5 

EVALUATION  OF  CASP  V 

Introduction 

The  Critical  Assessment  of  Techniques  for  Protein  Structure  Prediction  (CASP)  is  a 
project,  funded  by  the  Federal  government,  that  every  second  year  brings  together 
individuals  who  have  developed  tools  to  predict  features  of  protein  conformation  from 
sequence  data.  In  the  project,  often  referred  to  as  a contest,  crystallographers  are  invited 
to  submit  target  proteins  whose  structures  have  been  solved,  or  will  soon  be  solved,  but 
whose  solutions  have  not  yet  been  released  to  the  public.  This  permits  a blind  prediction 
exercise,  where  various  teams  compete  by  submitting  predictions  about  features  of  the 
target  conformation  to  a web  site,  where  they  are  organized  and  evaluated. 

The  submitted  predictions  can  include  a simple  secondary  structure  map,  a set  of 
coordinates  predicted  for  the  placement  of  atoms  in  the  backbone,  or  a lull  set  of 
coordinates  for  all  atoms.  The  standard  goals  were  targeted  to  identify  successful 
methods  for  comparative  modeling,  fold  recognition,  and  ab  initio  structure  prediction. 

In  different  years,  different  CASP  contests  have  included  efforts  to  predict  other  features 
of  protein  folding.  For  example,  in  1996  and  2000,  targets  were  first  included  where  the 
challenge  was  to  predict  where  on  a protein  a ligand  would  dock. 

In  the  summer  of  2002, 1 participated  in  the  CASP  V project.  This  chapter  reviews 
the  work  that  I did  to  prepare  the  submissions  and  summarizes  the  outcome. 


54 


55 


Evaluation  of  a Secondary  Structure  Prediction 

One  part  of  the  CASP  V project  requires  the  evaluation  of  the  secondary  structure 
predictions  submitted  by  various  contestants.  The  need  to  evaluate  predictions  and 
declare  a winner  raises  the  important  question:  How  does  one  generate  a quantitative 
score  concerning  the  quality  of  a secondary  structure  prediction?  In  the  CASP  project,  as 
well  as  in  the  general  literature,  several  scoring  metrics  are  used.  Each  of  them  requires 
an  “authentic”  secondary  structural  assignment  based  on  the  electron  density  determined 
by  the  crystal  structure  itself. 

One  widely  cited  tool,  also  used  in  CASP  V,  is  the  Q3  measure.  The  Q3  gives  the 

percentage  of  residues  predicted  correctly  as  one  of  three  states:  helix,  strand,  and  coil. 

The  definition  of  Qt  for  a single  conformational  state  is: 

Number  of  residues  correctly  predicted  in  state  i 

Q,= — xlOO  (5-1) 

Number  of  residues  observed  in  state  i 

where  i is  either  a helix,  strand  or  coil.  The  definition  of  Q3  for  all  three  states  is: 

Number  of  residues  correctly  predicted 

03  = — x 1 00  (5-2) 

Number  of  all  residues 

This  score  assumes  that  the  secondary  structure  is  assigned  site-by-site.  A perfect  score 
is,  presumably,  100,  and  various  predictions  can  be  ranked  based  on  this  score. 

Considerable  discussion  has  centered  on  the  usefulness  of  this  metric.  First,  the 
score  is  subjective  in  several  important  ways.  First,  there  is  no  such  thing  as  an 
experimental  secondary  structure.  The  experimental  data  produced  by  X-ray 
crystallography  (or  by  NMR)  are  a set  of  coordinates  for  atoms  in  a protein.  Secondary 
structure  is  an  abstraction  of  these  coordinates.  Converting  the  primary  experimental 


56 


data  into  an  assignment  of  secondary  structure  requires  definitions  (when  does  a segment 
exist  as  an  “or helix”  or  a “/?  strand”?).  These  definitions  are  themselves  subjective. 

To  illustrate  this,  we  consider  three  different  ways  to  define  secondary  structure  in 
terms  of  coordinates.  In  one,  secondary  structure  is  defined  by  the  two  dihedral  angles  in 
the  polypeptide  backbone  that  undergo  free  rotation.  The  0and  jangles  of  amino  acids 
in  natural  proteins  are  conveniently  presented  on  a Ramachandran  diagram,  as  shown  in 
Figure  5-1  (Ramachandran  and  Sasisekharan  1968).  In  natural  proteins,  certain 
combinations  of  dihedral  angles  are  more  populated  than  others,  and  certain  regions  of 
the  Ramachandran  diagram  are  defined  as  holding  amino  acids  in  a helices,  and  others 
hold  {3  strands.  Amino  acids  with  dihedral  angles  lying  outside  of  these  regions  are 
defined  as  coil.  Thus,  arbitrarily  placed  regions  on  the  Ramachandran  diagram  defines 
three  states  that  might  be  used  to  score  a secondary  structure  prediction,  where  the 
dihedral  angles  of  individual  amino  acids  are  extracted  from  crystallographic  coordinates. 

This  definition  of  secondary  structure  is  inadequate  for  evaluating  a prediction, 
however.  A single  amino  acid  may  have  (j)  and  ^values  that  place  the  peptide  unit  in  the 
middle  of  the  region  of  the  Ramachandran  diagram  that  defines  a helix,  but  still  not  have 
that  unit  be  a part  of  a helix.  An  a helix  is  stabilized  by  hydrogen  bonding  between 
backbone  atoms  coming  from  amino  acids  four  positions  removed  in  a chain.  In  a sheet, 
the  N-H  and  C=0  groups  of  the  backbone  participate  in  hydrogen  bonds  to  C=0  and  N-H 
groups  in  other  strands  still  more  distant  in  the  sequence. 


57 


Figure  5-1 . Ramachandran  plot  showing  the  (arbitrary)  boundaries  between  values  of  (f> 
and  y/that  indicate  a helices,  strands,  and  coils  (the  remainder  of  the 
diagram).  (Source:  http://www.phys.psu.edu/~lezon/prot4.html , Last  accessed 
February  20,  2004). 

Alternatively,  helices  and  sheets  might  be  defined  by  the  presence  of  hydrogen 
bonds.  In  an  a helix,  the  C=0  group  of  amino  acid  i forms  a hydrogen  bond  with  the  NH 
group  of  amino  acid  z+3.  In  an  antiparallel  /?  strand,  amino  acids  i and  i+2  form 
hydrogen  bonds  with  amino  acid  j and  j+ 2.  This  hydrogen  bonding  network  is  a more 
powerful  tool  for  assigning  secondary  structure  than  is  a simple  position  on  a 
Ramachandran  plot. 

Indeed,  the  tool  permits  a more  comprehensive  description  of  secondary  structural 
types,  including  3io  helices,  ;r  helices,  and  various  types  of  bends  and  turns.  A standard 
nomenclature  permits  these  to  be  identified  by  a careful  analysis  of  hydrogen  bonding 
patterns  (Kabsch  and  Sander  1983). 

Crystal  structures  of  proteins  generally  do  not  have  the  resolution  needed  to  see 
hydrogens,  however,  meaning  that  the  positions  of  hydrogens  and  hydrogen  bonding 


58 


patterns  must  be  inferred  from  the  positions  of  heavy  atoms.  Further,  the  dynamic 
behavior  of  protein  structures,  together  with  the  occurrence  of  distorted  secondary 
structural  elements,  means  that  not  all  helices  and  strands  evident  to  a human  eye 
inspecting  a crystal  structure  are  identified  using  programs  that  search  for  hydrogen 
bonding. 

A third  way  to  define  secondary  structure  relies  on  the  relative  orientation  of  the 
side  chains  in  a polypeptide  chain.  In  an  a helix,  the  side  chain  of  an  amino  acid 
protrudes  from  a cylinder  approximately  1.5  A along  the  helix  axis,  and  ~100  around  the 
helix  axis,  relative  to  the  side  chain  of  the  amino  acid  preceding  it  in  the  chain.  This 
relationship  is  graphically  described  by  a Schiffer-Edmundson  helical  wheel  (Schiffer 
and  Edmundson  1967)  which  is  a projection  of  a helix  down  its  long  axis  to  view  the 
relative  disposition  in  space  of  the  amino  acid  side  chains.  The  side  chains  in  a strand 
alternate  above  and  below  the  sheet.  As  the  side  chains  of  all  amino  acids  (except,  of 
course,  glycine)  contain  heavy  atoms,  the  relative  orientation  of  side  chains  is  easily  seen 
in  crystal  structures  with  satisfactory  resolution. 

As  secondary  structure  is  an  abstraction  of  experimental  data,  no  one  of  these 
definitions  is  more  correct  than  any  other.  What  is  clear,  however,  is  that  the  different 
definitions  need  not  yield  the  same  experimental  secondary  structure  assignments  from 
the  same  set  of  experimental  coordinates  (Fasman  1989).  The  subjective  nature  of 
experimental  secondary  structure  assignments  was  quantitated  by  Colloc'h  et  al.  (1993), 
who  compared  three  automated  tools  (DSSP  (Kabsch  and  Sander  1983),  P-curve  (Sklenar 
et  al.  1989)  and  Define  (Richards  and  Kundrot  1988))  that  assign  secondary  structure  to 
crystallographic  data.  The  P-curve  program  identifies  regularities  along  the  helicoidal 


59 


axis  in  a polypeptide  in  assigning  secondary  structure,  DSSP  considers  hydrogen-bonding 
patterns,  while  Define  measures  distances  between  C-  atoms.  Colloc'h  et  al.  (1993)  asked 
what  percentage  of  the  residues  in  the  protein  received  the  same  secondary  structural 
assignment  by  all  three  methods  applied  to  the  very  same  coordinate  data.  The  answer 
was  a strikingly  low  63%  (Colloc’h  et  al.  1993).  This  number  is  especially  relevant 
considering  that  current  secondary  structure  prediction  heuristics  are  routinely  yielding 
three- state  Q$  scores  of  approximately  70%. 

The  study  by  Colloc'h  et  al.  (1993)  make  the  general  statement:  One  cannot  score  a 
secondary  structure  prediction  objectively  if  the  experimental  secondary  structure  that 
serves  as  the  reference  is  subjective.  At  the  very  least,  the  subjectivity  in  assigning 
secondary  structure  to  crystallographic  data  sets  an  upper  limit  on  the  Q3  scores  that  a 
prediction  can  have.  Thus,  a perfect  score  may  be  as  low  as  63%  plus  one  third  of  37%, 
if  it  identifies  correctly  the  secondary  structures  at  the  sites  where  all  of  the  assignment 
tools  agree,  and  randomly  assigns  secondary  structure  at  the  remaining  sites. 

In  practice,  the  lack  of  objectivity  associated  with  defining  secondary  structure 
from  experimental  coordinates  alone  makes  it  impossible  for  the  residue-by-residue  score 
of  a secondary  structure  assignment  to  be  routinely  higher  than  -75-85%  (Rost  et  al. 
1994).  Higher  scores  obtained  by  predictions  judged  against  an  experimental  assignment 
generated  by  one  method  imply  lower  scores  when  judged  against  scoring  obtained  by 
another. 

Another  problem  arises  with  a consensus  prediction.  The  method  that  we  use 
assumes  that  all  homologous  proteins  have  analogous  folds,  implying  that  homologous 


60 


regions  in  each  protein  homolog,  defined  by  the  multiple  sequence  alignment,  must  have 
the  same  secondary  structure. 

This  need  not  be  true  as  a matter  of  practice.  During  sequence  divergence, 
secondary  structural  elements  may  be  gained  or  lost  in  the  protein  fold.  In  these  cases, 
homologous  proteins  need  not  have  the  same  secondary  structure.  In  practice,  tools  that 
build  conformational  models  from  aligned  sets  of  sequences  from  homologous  proteins 
recognize  these  regions  (they  are  frequently  gapped),  and  assign  no  core  secondary 
structural  element  to  these  regions.  This  means  that  when  a structure  assignment  emerges 
from  a specific  member  of  the  protein  family,  the  elements  that  can  be  predicted  are  only 
those  that  are  conserved  in  all  of  the  homologous  folds. 

When  domains  are  gained  and  lost,  the  difference  between  a predicted  structure, 
which  considers  only  core  elements,  and  an  experimental  structure,  which  includes  all 
elements,  can  be  severe.  For  example,  in  phospho-/?-galactosidase,  a prediction  target 
included  an  entire  domain  that  was  not  present  in  the  homologs  (Gerloff  and  Benner 
1995).  This  domain  was  recognized  as  being  non-core  by  the  predictors.  Likewise,  the 
predictors  identified  100%  of  the  core  secondary  structural  elements  correctly,  and  even 
built  a correct  tertiary  structure  model  from  them.  Because  the  Q$  score  was  calculated 
with  reference  to  the  secondary  structural  elements  in  one  specific  member  of  the  protein 
family,  however,  it  was  only  60%. 

A second  tool  considers  a segment  by  segment  score.  Here,  individual  elements  of 
secondary  structure  are  counted.  The  element  is  said  to  have  been  correctly  predicted  if, 
for  example,  the  prediction  assigns  the  correct  secondary  structure  to  half  of  the  sites  in 
the  experimentally  determined  element.  This  choice  is  also  subjective,  for  the  reasons 


61 


outlined  above.  The  metric,  for  example,  does  not  consider  whether  the  entire  element  is 
conserved  within  the  family,  however. 

The  scoring  scheme  also  has  elements  of  arbitrariness.  Various  segment  scoring 
metrics  count  the  prediction  of  a helix  or  strand  as  correct  if  and  only  if  more  than  half  of 
the  sites  are  assigned  as  helix  or  strand,  and  none  of  the  remaining  sites  are  assigned  as 
strand  or  helix.  Some  methods  score  a helix  if  it  has  fewer  than  four  amino  acids,  despite 
the  fact  that  a helix  cannot  be  this  short. 

A third  tool  considers  the  type  of  accuracy  that  is  needed  to  make  the  prediction 
useful  to  solve  biological  problems.  Here,  it  is  recognized  that  the  primary  use  of  a 
prediction  is  to  detect  distant  homology  between  proteins  whose  sequence  similarity  is 
low,  but  nonetheless  suggestive  of  common  ancestry.  Here,  the  tool  counts  the  number 
of  helical  elements  that  are  mispredicted  as  strands,  and  the  number  of  strand  elements 
that  are  mispredicted  as  helices.  Further,  it  distinguishes  between  core  and  non-core 
secondary  structural  elements,  recognizing  that  a misprediction  of  the  first  type  is  more 
likely  to  confound  the  application  of  the  prediction  than  a misprediction  of  the  second 
type. 

CASPV 

For  CASP  V,  concluding  in  the  winter  of  2002, 1 submitted  predictions  for  fourteen 
ab  initio  targets,  as  a part  of  a collaboration  with  Thomas  McCormack  and  Steven 
Benner.  The  goal  of  this  submission  was  to  explore  the  potential  for  predicting 
secondary  structure  using  a transparent  secondary  structure  prediction  method,  which 
combines  computation  and  manual  analysis  of  the  sequences  of  a set  of  homologous 
proteins  (Benner  and  Gerloff  1991). 


62 


Table  5-1  summarizes  the  sequence  and  structural  information  for  the  CASP  5 
proteins  that  were  predicted  in  the  context  of  existing  folds  and  evolutionary 
relationships.  The  targets  were  selected  based  on  the  availability  of  homologous  protein 
sequences  in  adequate  numbers  (typically  no  fewer  than  5,  with  an  evolutionary  distance 
separating  the  members  by  at  least  1 00  PAM  units,  and  an  even  distribution  over  the 
tree).  The  MasterCatalog,  a commercial  naturally  organized  database  developed  by 
EraGen  Biosciences  (Madison,  WI),  was  used  as  the  tool  to  browse  these  families. 

Multiple  alignments  were  generated  from  families  extracted  from  the 
MasterCatalog  using  the  automated  DARWIN-server  (Gonnet  et  al.  1992).  Secondary 
structures  were  predicted  based  on  automated  heuristics  to  assign  surface,  interior,  active 
site  and  parsing  residues  by  analysis  of  patterns  of  conservation  and  variation  among 
homologous  protein  sequences  in  light  of  evolutionary  models  that  interpret  amino  acid 
substitutions  as  the  consequence  of  neutral  variation  subjected  to  functional  constraints 
(Benner  et  al.  1994). 

In  some  categories,  a search  of  the  MasterCatalog  identified  a homologous  protein 
whose  crystal  structure  was  already  solved.  These  were  discarded,  as  they  are  not  ab 
initio  candidates.  A list  of  the  targets  is  shown  in  Table  5-1  and  was  adapted  from  a 
special  issue  of  PROTEINS:  Structure,  Function,  and  Genetics  dedicated  to  CASP  5 
(Kinch  et  al.  2003). 

The  scores  of  my  predictions  are  collected  in  Table  5-2,  together  with  data 
concerning  the  target  and  the  evolutionary  family  to  which  it  belongs.  These  scores  were 
provided  by  the  CASP  automated  scoring  tool,  and  obtained  from  CASP  V (Secondary 
and  3D  Protein  Structure  Prediction  Evaluation  Info,  Last  accessed  February  12,  2004). 


63 


Table  5-1 . Summary  of  Prediction  Targets  for  the  CASP  V ab  initio  Project 

Target  Length  Name  species  PDB 

ID  Method  Description  (post-constant) 


TO  129 


182 


H10817 
H.  influenzae 


lizm 

X-ray 


T0131 

100 

HI0857 
H.  influenzae 

X-ray 

TO  133 

312 

HIP1R  N-terminal 

domain 

rat 

X-ray 

TO  146 

325 

ygfz 

E.  coli 

X-ray 

TO  148 

163 

HI  1034 
H.  influenzae 

linO 

X-ray 

TO  149 

318 

yjiA 
E.  coli 

lnij 

X-ray 

TO  156 

157 

Probable  SAM- 
dependent 
methyltransferase 
M.  tuberculosis 

X-ray 

TO  157 

138 

yqgF 

E.  coli 

X-ray 

TO  162 

286 

F-action  capping 
protein  a- 1 subunit 
chicken 

lizn 

X-ray 

TO  165 

318 

Cephalosporin  C 
deacetylase 
B.  subtilis 

117a 

X-ray 

TO  168 

327 

Glutaminase 
B.  subtilis 

lmki 

X-ray 

TO  172 

Conserved 
hypothetical  protein 
MRAW 
T.  maritima 

lm6y 

ln2x 

X-ray 

TO  187 

417 

TM1585 
T.  maritima 

loOu 

X-ray 

Novel  fold  composed  of  seven  helices.  First  four 
helices  form  a distorted  up-and-down  bundle,  whereas 
the  rest  assemble  as  a 3-helical  left-handed  bundle. 
Helix  2 and  helix  5 form  a tight  interaction. 
Preliminary  version  of  the  structure.  No  side-chain 
assignments.  Not  used  for  assessment. 
a /a  superhelix  fold,  same  family  as  N-terminal 
domain  of  phosphoinositide-binding  clathrin  adaptor 
(13%  lhg5). 

Three  domains:  Domains  1 and  2 in  tight  contact  and 
represent  a potential  duplication  of  a ferredoxin-like 
unit  with  an  additional  (3-strand  inserted  at  the  edge  of 
the  P-sheet.  Domain  1 is  circularly  permuted  with 
respect  to  domain  2.  No  side-chain  assignments  for 
domain  3.  Domains  2 and  3 connected  by  a sequence- 
conserved  linker. 

Tandem  repeat  of  a ferredoxin-like  fold  with  swapped 
N-terminal  strands.  Each  domain  is  a possible  remote 
homologue  of  Ribosome  recycling  factor  a+P  domain 
(11%,  15%  lek8). 

Two  domains:  N-terminal  Nitrogenase  iron  protein 
family  (17%  lj8m).  C-terminal  novel  fold  with  some 
structural  similarity  to  lah6,  legf_B:  149-337,  lptf 
and  target  TO  1 87 1 . 

Phosphohistidine  domain  superfamily,  the 
“swiveling”  domain  fold  (14%  ldik,  Dali  z score  7.5). 
Homology  inferred  from  structural  similarity. 

Ribonuclease  H-like  superfamily  (15%  lhjr,  Dali  z 
score  9.6).  Homology  inferred  from  the  presence  of 
described  RNaseH  motifs. 

Three  domains:  N-terminal  three-helical  bundle, 
middle  possible  rebredoxin-like  zinc  finger  that  lost 
zinc  ligands  (9%  lrfs),  C-terminal  novel  fold  five- 
stranded  meander  flanked  by  two  helices, 
a/p-hydrolase  superfamily  (18%  la8s). 


Domain  structure  shared  by  P-Lactamase/D-ala 
carboxypeptidase  superfamily  (7%  1 fof).  Found  by 
transitive  PSI-BLAST. 

SAM -dependent  methyltransferase  (16%  ljg2)  with 
an  inserted  SAM-like  fold  domain  (15%  lcuk). 


Two  domains:  N-terminal  structurally  similar  to 
Cobalt  precorrin-4  methyltranferase  C-terminal 
domain  (8%  1 cbf),  C-terminal  Rossman-type  fold 
(11%  lgpj).  T0187_l  shares  some  topological 
similarity  with  T0149_2. 


Table  5-3  shows  the  scores  obtained  by  David  Baker,  a predictor  who  was  ranked 
comparably  or  better.  The  results  shown  in  this  table  represent  the  top  models  submitted 
by  Dr.  Baker.  In  the  CASP  5 experiment,  predictors  were  allowed  to  rank  their 


64 


respective  submissions  for  each  target.  To  compare  Baker’s  results,  a sequence 
alignment,  my  secondary  structure  prediction,  and  Baker’s  secondary  structure  prediction 
are  presented,  for  each  target  in  this  discussion,  in  each  of  the  respective  Results  sections. 


Table  5-2.  Summary  of  Results  for  the  CASP  V ab  initio  Project 


Number  of 
Members  in 

Target 

Qs 

Qhelix  Qstrand 

Qcoil 

sov 

Length 

the  Family 

T0129_l 

61.8 

57.3  0.0 

70.0 

64.0 

182 

8 

T0129_2 

66.5 

64.5  0.0 

70.0 

66.1 

182 

8 

T0131 

Not  publicly  available  as  of  yet 

T0133_l 

57.7 

53.1  50.0 

69.0 

56.1 

312 

11 

T0133_2 

54.6 

49.3  50.0 

67.9 

53.7 

213 

11 

TO  146 

55.5 

23.0  57.4 

66.1 

40.1 

325 

12 

TO  148 

69.1 

73.8  54.2 

77.6 

69.3 

163 

15 

TO  149 

64.7 

54.7  56.4 

78.1 

61.0 

318 

78 

TO  156 

70.5 

100.0  65.5 

62.5 

66.9 

157 

42 

T0157 

62.5 

63.4  62.5 

61.5 

62.3 

138 

69 

TO  162 

55.3 

48.1  47.5 

70.3 

59.9 

286 

21 

TO  165 

64.8 

48.0  61.5 

82.3 

65.5 

318 

13 

T0168 

53.4 

30.5  50.0 

84.6 

53.5 

327 

32 

TO  172 

71.3 

80.4  37.0 

74.3 

68.3 

299 

11 

T0187 

73.1 

64.1  66.2 

85.2 

72.3 

417 

27 

Table  5-3. 

Summary  of  Baker’s  results  for  the  CASP  V ab  initio  Project 

Target 

Qs 

Qhelix 

Qstrand 

Qcoil 

SOV 

Length 

TO  129 

86.5 

90.9 

0.0 

78.3 

91.9 

182 

T0131 

Not  publicly  available  as  of  yet 

T0133 

76.5 

82.1 

0.0 

64.3 

69.6 

312 

TO  146 

70.6 

86.9 

50.8 

71.8 

50.5 

325 

TO  148 

77.2 

87.7 

62.5 

77.6 

73.5 

163 

TO  149 

61.3 

76.5 

45.1 

58.8 

59.0 

318 

T0156 

51.3 

79.3 

18.2 

65.3 

47.4 

157 

T0157 

84.2 

95.1 

72.5 

84.6 

81.6 

138 

TO  162 

69.8 

89.4 

56.2 

59.3 

54.3 

286 

TO  165 

80.2 

79.0 

75.0 

84.5 

82.2 

318 

TO  168 

78.8 

85.1 

55.0 

78.6 

77.7 

327 

TO  172 

77.1 

85.0 

72.7 

72.1 

84.0 

299 

TO  187 

72.2 

91.1 

42.3 

62.0 

64.4 

417 

65 


General  Trends 

From  theory,  one  expects  that  the  quality  of  a secondary  structure  based  on  a 
multiple  sequence  alignment  will  improve  as  the  number  of  independent  sequences  in  the 
alignment  improves.  Figure  5-2  shows  a plot  of  the  various  scores  versus  the  number  of 
proteins  in  the  family.  It  appears  as  if  there  is  no  correlation  between  the  number  of 
proteins  in  the  family  and  the  Q3  score. 


Number  of  Proteins  in  the  Family 

Figure  5-2.  Plot  of  the  Q3  scores  versus  the  number  of  proteins  in  the  family. 

This  number  need  not  be  a good  metric  of  the  information  contained  in  a set  of 
homologous  sequences.  Thus,  if  the  set  contains  many  sequences  that  are  nearly 
identical,  then  a large  set  need  not  necessarily  generate  a prediction  that  is  better  than  a 
smaller  set  having  larger  sequence  diversity.  The  analogy  has  been  made  by  pointing  out 
that  does  that  one  does  not  get  more  information  by  having  multiple  copies  of  the  New 
York  Times  all  printed  on  the  same  day. 


66 


Second,  the  theory  suggests  that  if  the  multiple  sequence  alignment  is  not  of  high 
quality,  then  the  prediction  will  not  be  of  high  quality  as  well.  In  the  prediction  exercise, 
we  frequently  discarded  individual  sequences,  and  sometimes  even  entire  families  of 
sequences,  if  including  them  would  make  the  multiple  sequence  alignment  poor  in 
quality.  Sometimes,  two  separate  families  were  examined  separately,  and  a joint 
prediction  assembled  from  two  separate  predictions.  Clearly,  it  is  difficult  to  place 
quantitative  metrics  on  the  overall  sequence  diversity  within  a family.  This  is  possible, 
however,  if  one  determines  the  number  of  changes  in  the  family  per  site,  to  give  a metric 
of  the  overall  divergence.  This  would  require  putting  the  protein  sequences  into 
MacClade,  a computer  program  for  phylogenetic  analysis,  and  reading  the  number  of 
changes  at  the  level  of  acid  sequence  in  the  tree.  This  was  not  done  in  this  work. 

My  predictions  were  poorer  than  those  of  Baker,  who  used  comparative  modeling. 
Dr.  Baker’s  method  employs  a computational  method,  known  as  Rosetta  (Simons  et  al. 
1997),  a fragment-insertion  method.  Rosetta  is  comprised  of  up  to  five  steps:  A) 
detection  of  the  best  parent  for  each  putative  domain,  B)  sequence  alignment  to  that 
parent,  C)  modeling  of  structurally  variable  regions,  D)  optimization  to  increase  the 
physical  reasonableness  of  the  final  model,  and  E)  re-assembling  the  complete  chain 
when  domains  were  parsed  and  processed  individually. 

These  scores  approach  the  maximum  expected  theoretically  for  a prediction  based 
on  a multiple  sequence  alignment.  It  was  clear  that  some  elements  were  predicted 
incorrectly.  As  the  automated  scores  are  inadequate  to  provide  information  about  what 
was  done  correctly,  and  what  was  done  incorrectly,  we  examined  the  predictions  in 
greater  detail  to  obtain  clues  about  how  our  prediction  tools  might  be  improved. 


67 


Examination  of  Predictions 

The  goal  of  CASP  is  to  help  advance  the  methods  of  identifying  protein  structure 
from  sequence.  For  this  reason,  the  predictions  prepared  for  CASP  are  those  made  and 
announced  before  experimental  knowledge  of  a structure  is  available.  In  this  way,  CASP 
provides  a means  of  objective  testing  of  these  methods  via  the  process  of  bona  fide 
predictions.  We  attempt  to  review  every  example  of  a transparent  bona  fide  prediction 
based  on  evolutionary  analyses  that  does  not  rely  on  the  identification  of  a homolog 
whose  structure  is  already  solved  experimentally.  Helical  wheel  color  diagrams  illustrate 
helical  wheels  for  each  helix.  The  residues  are  color  coded  for  hydrophobic  (red), 
hydrophilic  (blue)  and  conserved  (green)  amino  acid  types.  The  wheels  assume  the  ideal 
helical  value  of  3.6  residues  per  turn.  The  hypothesis  that  we  explored  in  this  activity 
relates  to  the  fact  that  our  strategy  for  predicting  secondary  structural  elements  commonly 
misses  six  types  of  secondary  structural  elements. 

Some  of  these  are  related  to  the  fact  that  a prediction  must  be  compared  with  an 
experimental  secondary  structural  assignment  that  places  helices  and  strands  within  a 
protein  sequence  based  on  an  analysis  of  the  position  of  atoms  as  represented  by  the 
electron  density  in  the  crystal  structure.  Some  of  these  are  related  to  difficulties  in  the 
method  themselves. 

Now,  we  discuss  these  by  class.  The  first  potential  error  of  the  first  class,  therefore, 
may  arise  should  the  experimental  assignment  of  the  secondary  structure  be  in  error.  In 
this  case,  it  will  be  impossible  to  achieve  a perfect  score  if  the  secondary  structure  is 
predicted  correctly.  Another  error  associated  with  this  class  is  the  identification  of  short 
helices  and  strands.  The  experimental  assignment  may  classify  a helix  of  length  three, 


68 


for  example,  even  though  helices  cannot  be  this  short.  This  case  will  not  be  predicted  in 
this  method. 

There  are  four  examples  of  the  second  class  of  error,  those  that  are  related  to  the 
method.  First,  if  the  experimental  assignment  is  correct,  the  segment  may  be  aligned  to  a 
portion  of  the  multiple  sequence  alignment  that  is  highly  gapped.  In  these  regions, 
termed  “non-core,”  it  is  very  difficult  to  extract  useful  information  from  the  aligned 
sequences.  Evolution  based  strategies  require  an  accurate  multiple  sequence  alignment  in 
order  to  identify  positions  of  insertions  and  deletions,  conserved  residues  and 
hydrophobic  conservation  patterns.  Second,  an  erroneous  prediction  may  be  made  if  the 
secondary  structural  element  is  not  conserved.  In  this  method  we  are  creating  a 
consensus  prediction  of  secondary  structure,  and  if  this  unit  is  not  found  throughout  the 
family,  we  will  not  detect  it.  Third,  secondary  structural  elements  near  an  active  site  are 
difficult  to  detect.  Typically,  these  elements  are  highly  conserved  and  do  not  display  any 
variation.  Finally,  structural  elements  that  lie  on  the  surface  or  entirely  within  the  fold  of 
a globular  protein  are  more  difficult  to  assign. 

T1029 

Target  TO  129  was  submitted  by  an  unnamed  crystallographer.  The  structure  has 
been  published;  the  Protein  Data  Bank  (PDB)  identifier  for  this  structure  is  lizm.  The 
target  protein,  designated  HI0817,  is  from  Haemophilus  influenzae  and  is  182  amino 
acids  in  length.  The  MasterCatalog  yielded  at  total  of  14  homologous  proteins  in  a 
preconstructed  family  designated  MCI 8669.  The  full  length  sequences  from  MCI 8669 
were  culled  for  redundancies.  This  resulted  in  eight  non-redundant  sequences.  This  small 
number  of  sequences  implies  that  the  predicted  structure  would  have  low  overall 


reliability. 


69 


The  distribution  of  the  sequences  across  the  tree  was  reasonably  balanced.  In  the 
subtrees,  pairs  separated  by  8,  41,  and  119  provides  a range  of  divergence,  and  two  of 
these  subfamilies  were  themselves  separated  by  about  70  PAM  units.  Therefore,  a 
prediction  was  attempted. 

The  eight  non-redundant  sequences,  with  the  target  listed  as  sequence  “b*”,  were 
submitted  to  the  Darwin  server,  which  generated  a multiple  sequence  alignment  (Figure 
5-3)  and  a phylogenetic  tree  (Figure  5-4).  The  subfamilies  apparent  on  the  tree  are 
identified  by  horizontal  lines  within  the  MSA.  Because  the  multiple  sequence  alignment 
represents  a small  number  of  sequences  and  not  a substantial  amount  of  divergent 
evolution,  it  was  not  productive  to  examine  each  subfamily.  No  distant  families  were 
found  by  the  “bridges”  utility  in  the  MasterCatalog,  which  relies  on  a sequence 
comparison  to  detect  distant  homology. 

In  the  fully  computational  part  of  the  prediction  process,  the  Darwin  server 
assigned  sites  to  the  surface  (S,s,  reflecting  strong  and  weak  assignments),  the  interior 
assignments  (I,i,  strong  and  weak  assignments),  and  the  "active  site"  (A,a,  strong  and 
weak  assignments). 

These  were  used  as  the  starting  point  for  a manual  inspection.  Parses  were  assigned 
to  positions  010-014  (parsing  strings  PG  and  PS),  025-029  (parsing  string  PxSP),  040- 
049  (parsing  string  GGGGNGPD),  058-061  (parsing  string  SNDN),  064-074  (parsing 
string  PPKGS),  088-091  (parsing  strings  DPD,  DG,  and  gapping),  099-105  (parsing 
string  PDGDgapD),  152-158  (based  on  a gap  that  is  well  anchored  between  sequences  f 
and  d),  158-167  (based  on  the  parsing  string  DDN  and  a large  number  of  consecutive 
surface  areas),  1 89-200  (parsing  strings  GxP,  PxP,  and  GxxP). 


70 


10  20  30  40  50  60 

I I I -—--I I •••!■• 

10  20  30  40  50  60  70 

I I I I I I I 

d - SAY SAFS SLLAEAAMPVS PAELHGHLLGRVCAG_AGF DEAAWOHAA  AELLGGAPGERLKAA 

f - QMGLSVTAPELHGSLSGLLAGG_GGNGPDWLAMILAD AEVAAPPKG  SV 

e - MSSYSDFSQQLKTAGIALSAAELHGFLTGLICGG_I_HDQSWQPLLFQFTNENHAYP TALLQE 

b*- MLISHSDLNQQLKSAGIGFNATELHGFLSGLLCGG_L_KDQSWLPLLYQFSNDNHAYP  TGLVOP 

g - MSIENTLPTYPSLALALSQQAVALTPAEMHGLISGMLCGG_S_KDNGWQTLVHDLTNEGVAFP QALSLP 

h - MLMSIQNEMPGYDEMNRFLNQQGAGLTPAEMHGLISGMICGG_N_NDSSWQPLLHDLTNEGLAFG HELAQA 

a - MLMSIONEMPGYNEMNOYLNOOGTGLTPAEMHGLISGMICGG_N_DDSSWLPLLHDLTNEGMAFG  HELAQA 

C - ML  SGGL  SLNDKS WQALVFDYTNDGMGWP I G ALASA 


iiiaisasi . s . SsiSSiissi . S . isiiaia. iis . ili . . . s . SsssiSsliisissssiii . . s . . sis . s 
ppppp  PPPPP  PPPPPPPPPP  PPPP  PPPPPPPPPP 

hhhhhhhhhhhh  eeeeeee  eeeee  eee 

70  80  90 100  110  120  130 

I I I — I I I I--- 

80  90  100  110  120  130  140 

I I I I I I I 

d - LSGLLGMVRQDFSAGE_VAWMLLPD_D_ETPLAQRTEALGQWCQGFLAGFGLTAREGS_LTGEAEEVLQDMAA 
f - LERLYQATASQLEDPD_FAFQLLLAD_D_GATLAARADALFEWCRAFLGGFGLAAHSRSVLSAEGDEILRDLAK 
e - VTQIQQHISKKLADIDGFDFELWLPENE_D_DVFTRADALSEWTNHFLLGLGLAQPKLDKEKGDIGEAIDDLHD 
b*-  VTELYEQISQTLSDVEGFTFELGLTEDE  N VFTQADSLSDWANQFLLGIGLAQPELAKEKGEIGEAVDDLQD 
g - LQQLHEATQEALEN_EGFMFQLLI PEGE_DVTVFDRADALSGWVNHFLLGLGMLQPKLAQVKDEVGEAIDDLRN 
h - LRKMHAATSDALED_DGFLFQLYLPEGD_DVSVFDRADALAGWVNHFLLGLGVTQPKLDKVTGETGEAIDDLRN 
a - LRKMHSATSDALQD_DGFLFQLYLPDGD_DVSVFDRADALAGWVNHFLLGLGVTQPKLDKVTGETGEAIDDLRN 
C - _EQILLAMSAQLVDTD_FELSLLLPEGEGEEALFELADAVAEWINHFISGLGLSGANLKHASVEAKEALEDLEE 

isSIiSiisSsISS.S.IsIsIilsSss.S.sIIssisiliSIis.IIi.I.I. isSiSsssssisAIIsAIsS 
PPPP  PPPPPPPP  pp  PPPP  PPP  PPPP 

hhhhhhhhhhhhh  eeeeeee  hhhhhhhhhhhhhhhh  coil  hhhhhhhhhhhhh 

140  150  160  170  _ 180 

--I I I • • • I -•••!•• 

150  160  170  180  190  200 

•I I I I I I 

d - IAQVQ GQLEDSEDGETDYMEVMEYLRVAPLLLFAECGKPLEPAP KPSLH 

f - LAQASVDDFDMNEEEEDGS  LEEIEEFVRVAVLLLHGDC 

e - ICQL GYDESDDKEELS EALEE 1 1 EYVRTLACLL FTHFQPQLPE_QKPVLH 

b*-  ICQL  GYDEDDNEEELAEALEEIIEYVRT IAMLF  Y SHFNEGE I E_SKPVLH 

g - IAQL GYDEDEDQEELAQSLEEWEYVRVAAILCHIEFTQQKPTAPEMHKPTLH 

h - IAQL GYDESEDQEELEMSLEEIIEYVRVAALLCHDTFTRQQPTAPEVRKPTLH 

a - IAQL  GYDEDEDQEELEMSLEEIIEYVRVAALLCHDTFTHPQPTAPEVQKPTLH 

C - MSKL GIDEEDDLAEQAELLEQVIEHIKACVLVLHAEFGVKPEQD TKPTVH 


Iiils . asissSsSsssiSssissliAilsililllsssIsSSSsss . aiSa . iia 
PPPP  PPPPPPPPPP  PPPPPPPPPPPPP 

hhh  hhhhhhhhhhhhhh  eeeeee 

hhhhhhhhhhhhhhh  eee 

Figure  5-3.  The  alignment  for  Target  TO  129.  There  are  two  sets  of  numbering;  the  upper 
set  corresponds  to  the  target  sequence  and  the  lower  set  refers  to  the  overall 
alignment  and  the  numbering  in  this  discussion.  All  alignments  were 
generated  by  the  DARWIN  package.  The  sequence  for  which  the  crystal 
structure  is  unknown  is  b,  and  is  indicated  by  a *. 


71 


Figure  5-4.  Phylogenetic  trees  for  Target  T0129,  MC  familyl  8669.  The  tree  is  generated 
by  the  DARWIN  package  as  the  most  probable  phylogenetic  tree  by  a least- 
squares  fitting  of  the  PAM  distance  data. 

Assigning  secondary  structure  to  the  long,  unparsed  segment  between  positions  108 
and  152  was  clearly  problematic.  We  considered  assigning  the  123-130  segment  as  a 
parse,  based  on  the  parsing  string  SGLGLSG,  although  the  conservation  of  hydrophobic 
residues  at  positions  122,  125,  and  127  makes  it  likely  that  this  is  not  a surface  coil.  Other 
potential  parses  could  be  seen  at  positions  115-116  (dipeptide  SG),  128-130  (distributed 
SG  and  P),  and  136-140  (distributed  G).  We  assigned  structure  ignoring  the  parse  issue, 
hoping  to  later  return  to  assign  parses. 

A 3.6  residue  pattern  of  interior  and  surface  assignments  between  parses  was  then 
sought  as  an  indicator  of  standard  alpha  helices.  Several  regions  stood  out.  A potential 
helix  (Figure  5-5a)  is  anchored  on  segment  013-019.  Amphiphilicity  could  be  seen  for  the 
012-023  segment;  the  pattern  is  broken  by  position  024.  This  helix  includes  two  positions 


72 


(012  and  013)  that  belong  to  the  parsed  segment.  No  alternating  amphiphilicity  was  seen, 
meaning  that  the  segment  could  not  plausibly  be  assigned  as  a strand.  The  helix 
assignment  was  therefore  accepted,  with  lower  probabilities  assigned  to  the  first  two 
positions. 

A second  potential  helix  (Figure  5-5b)  is  anchored  on  segment  075-082. 
Amphiphilicity  is  good  for  the  075-088  segment,  with  position  081  being  incorrectly 
placed  as  an  interior  residue  (albeit  not  a strong  one)  on  the  surface  arc.  A third  potential 
helix  (Figure  5-5c)  is  anchored  on  segment  110-116.  Amphiphilicity  is  perfect  for  the 
111-122  segment,  and  is  broken  by  position  110  and  123.  As  the  first  turn  of  a helix 
frequently  brings  hydrophilic  residues  into  the  interior  arc,  it  is  possible  that  positions 
109-110  are  also  part  of  the  helix. 


(a) 


(b) 


(c) 


Figure  5-5.  Schiffer-Edmundson  helical  wheels  show  3.6-residue  periodicity  in  surface 
(s)  and  interior  (i)  assignments  for  five  segments  of  Target  T0129. 


73 


Another  potential  helix  (Figure  5-5d)  is  anchored  on  segment  138-143. 
Amphiphilicity  is  good  for  the  136-152  segment,  if  the  conserved  E and  D are  assumed  to 
be  on  the  surface.  Strengthening  the  case  for  a helix  is  that  these  positions  are  brought 
together.  This  undoubtedly  has  functional  implications,  involving  the  protein  with 
another  protein  or  small  molecule. 

Every  prediction  has  one  problematic  segment,  and  this  prediction  is  no  exception. 
The  region  including  positions  172-178  is  evidently  near  an  active  site,  suggested  by  the 
conserved  E at  position  175.  Amphiphilicity  is  seen  on  a helical  wheel  (Figure  5-5e) 
from  170-178,  assuming  that  position  175  lies  on  the  surface  and  the  gap  in  sequence  f is 
moved.  Indeed,  if  we  could  ignore  the  gap  in  sequence  f,  the  amphiphilicity  would 
extend  back  to  position  165.  It  is  not  uncommon  for  helices  to  delete  a turn,  so  this  might 
not  be  unreasonable.  In  this  case,  the  parse  is  at  the  R,  and  the  following  segment  is  an 
internal  strand.  Position  174,  which  largely  contains  hydrophobic  amino  acids  other  than 
an  E in  sequence  f,  is  at  the  boundary  between  the  surface  and  interior  arcs  of  the  helix. 

The  amphiphilicity  is  broken  by  a clash  at  the  beginning  and  end  of  the  helix, 
positions  172  and  179.  Following  179  is  a string  of  interior  residues  with  only  a single  P 
serving  as  a weak  parse.  Therefore,  two  alternative  secondary  structural  assignments 
were  proposed,  the  first  is  a helix  from  position  170-178,  followed  by  a strand  from  181 
to  184.  Alternatively,  the  segment  could  be  one  long  helix  that  extends  inside  the  folded 
structure. 

Strands  were  canonically  assigned  to  short  segments  between  parses.  A long  strand 
was  proposed  for  the  segment  030-040,  recognizing  that  this  contains  some  highly 
conserved  positions  suggestive  of  an  active  site.  Since  there  are  few  sequences  in  the 


74 


alignment,  and  we  do  not  canonically  assign  any  secondary  structure  to  regions  near  an 
active  site,  this  segment  was  assigned  a coil-strand  conformation.  The  092-098  segment 
in  addition  showed  alternating  periodicity,  and  was  strongly  assigned  as  a strand. 

Results  for  T0129 

Target  T0129  contains  7 alpha  helices.  The  first  four  and  the  last  three  form 
separate  bundles  (N-and  C-terminal  bundles)  separated  by  a long,  essentially  extended 
crossover  loop  between  helices  4 and  5 (Fig.  5-6).  An  additional  noteworthy  feature  is 
another  extended  loop  segment  connecting  helices  6 and  7.  Helices  2 and  5 are  internal, 
judging  by  the  surface  and  interior  assignments  made  by  DSSP.  The  contact  site  appears 
to  be  helices  2 and  5. 


Figure  5-6.  Ribbon  representation  of  TO  129.  The  figure  on  the  right  represents  what  we 
did  right  (green)  and  wrong  (red). 

The  multiple  sequence  alignment  in  Figure  5-7  displays  a number  of  features 
associated  with  the  experimental  crystal  structure  (lizm).  Immediately  beneath  the 
alignment  is  the  surface  and  interior  assignments  as  predicted  by  DARWIN,  labeled 


75 


PDAR.  The  following  row  of  letters,  labeled  DSSP,  is  the  experimentally  determined 
surface  and  interior  assignments.  Following  the  surface  and  interior  assignments  is  the 
target  sequence,  TARG.  Beneath  the  target  sequence  is  a cartoon  representation  of  the 
secondary  structure  of  this  protein.  These  elements,  helices  and  strands,  are  colored  to 
illustrate  the  quality  of  the  predictions;  green  are  correctly  predicted  elements,  red 
represents  mispredicted  elements,  and  orange  represents  missed  elements.  The  final  rows, 
labeled  EXPT,  PRED  and  Baker,  represent  the  experimental  secondary  structural 
elements,  the  predicted  structural  elements,  and  those  predicted  by  Baker,  respectively. 

To  evaluate  our  prediction  of  target  TO  129,  it  is  possible  to  categorize  errors  in 
secondary  structure  predictions  in  the  various  classes.  To  evaluate  our  success,  we 
constructed  a flow  chart.  First,  scan  the  experimental  secondary  structure  assignment 
sequentially.  Next,  record  the  helix  and  strand  assignments.  In  TO  129,  there  are  7 alpha 
helices  and  0 strands.  Finally,  bin  the  secondary  structure  assignment  predictions  as 
correct  if  the  identified  segments  in  the  target  overlap  with  a predicted  element  of  the 
same  type  sufficient  that  we  could  assign  the  prediction  as  being  “correct”  for  this 
element.  In  TO  129,  there  were  5 correct  predictions  (helices  A,  E,  G,  I,  and  J). 

For  the  remaining  elements,  we  bin  them  in  three  groups.  Firstly,  the  element  was 
assigned  in  the  experimental  structure,  but  not  in  the  prediction.  Secondly,  the  element 
was  assigned  in  the  prediction,  but  not  in  the  experimental  structure.  Finally,  the  element 
was  different  in  the  experimental  structure.  There  were  5 incorrect  predictions  (elements 
B,  C,  D,  F,  and  H). 

We  first  evaluated  the  quality  of  the  automatic  secondary  structure  prediction.  63 
percent  of  the  residues  assigned  strongly  to  the  surface  in  fact  were  on  the  surface.  100 


76 


percent  of  the  residues  assigned  strongly  to  the  inside  were  in  fact  on  the  inside.  Of  the 
residues  that  are  weakly  assigned,  we  observed  that  53  percent  of  the  residues  assigned  to 
the  surface  were  in  fact  on  the  surface  and  90  percent  of  the  residues  assigned  to  the 
interior  were  in  fact  in  the  inside.  This  appears  to  be  a relatively  poor  performance,  as 
expected  given  the  small  number  of  sequences  in  the  family.  In  many  cases,  the 
assignment  of  the  surface  was  made  based  on  variation  in  the  most  distant  sequences. 
These  most  distant  sequences  were,  however,  questionably  aligned  in  parts  of  the 


multiple  sequence  alignment. 


10  20  30  40  50  60 

I I I I I — ;-l 

10  20  30  40  50  60  70 


| | | I I I l 

(j  _ SAY S AF  S SLLAEAAMPV S PAELHGHLLGRVC AG_AGF DEAAWQH_AAAELLGGAPGERLKAA 

f _ QMGLSVTAPELHGSLSGLLAGG_GGNGPDWLAMIL_ADAEVAAPPKG SV 

e - MSSYSDFSQQLKTAGIALSAAELHGFLTGLICGG_I_HDQSWQPLLFQFTNENHAYP TALLQE 

b* - SHSDLNQQLKSAGIGFNATELHGFLSGLLCGG_L_KDQSWLPLLYQFSNDNHAYP TGLVQP 


g - MSIENTLPTYPSLALALSQQAVALTPAEMHGLISGMLCGG_S_KDNGWQTLVHDLTNEGVAFP QALSLP 

h - MLMSIQNEMPGYDEMNRFLNQQGAGLTPAEMHGLISGMICGG_N_NDSSWQPLLHDLTNEGLAFG HELAQA 

a - MLMSIQNEMPGYNEMNQYLNQQGTGLTPAEMHGLISGMICGG_N_DDSSWLPLLHDLTNEGMAFG HELAQA 

c _ MLSGGLSLNDKSWQALVFDYTNDGMGWPIG_ALASA 

★ ..  ......  ........  . • ..... 

PDARiiiaisasi . s . SsiSSiissi . S . isiiaia. iis . ili . . . s . SsssiSsliisissssiii . .s. .sis.s 
DSSP  sislsi  isilSSisIsIilllllllllllllli  I sssilisillsillSsssli  Ssliil 


TARG 


EXPT 

PRED 

Baker 


L I SHSDLNQQLKSAGIGFNATELHGFLSGLLCGG 

L KDQ  SWL  PLL Y Q F SNDNHAYP 

r n 

TGLVQP 

E 

Pi. 

D 

fM uu 

o 

HHHHHHHHHH 

HHHHHHHHHHHHH 

HHHHHHHHH 

HHH 

HHHHHHHHHHH 

EEEEEE 

EEEEE  EEE 

HHHHHHHHH 

HHHHHHHHHHHH 

HHHHHHHH  HHH 

HHHHHH 

70  80  90 100  110  120  130 

| | . |_-  — • ....... | ......... | ......... | .... , . ... | . • • 

80  90  100  110  120  130  140 

, ....  | ..... | .........  | .........  | .........  | .........  | .........  | .......  . 

d - LSGLLGMVRQDFSAGE_VAWMLLPD_D_ETPLAQRTEALGQWCQGFLAGFGLTAREGS_LTGEAEEVLQDMAA 
f - LERLYQATASQLEDPD_FAFQLLLAD_D_GATLAARADALFEWCRAFLGGFGLAAHSRSVLSAEGDEILRDLAK 
e - VTQIQQHISKKLADIDGFDFELWLPENE_D_DVFTRADALSEWTNHFLLGLGLAQPKLDKEKGDIGEAIDDLHD 
b*-  VTELYEQISQTLSDVEGFTFELGLTEDE_N__VFTQADSLSDWANQFLLGIGLAQPELAKEKGEIGEAVDDLQD 
g - LQQLHEATQEALEN_EGFMFQLLIPEGE_DVTVFDRADALSGWVNHFLLGLGMLQPKLAQVKDEVGEAIDDLRN 
h - LRKMHAATSDALED_DGFLFQLYLPEGD_DVSVFDRADALAGWVNHFLLGLGVTQPKLDKVTGETGEAIDDLRN 
a - LRKMHSATSDALQD_DGFLFQLYLPDGD_DVSVFDRADALAGWVNHFLLGLGVTQPKLDKVTGETGEAIDDLRN 


Figure  5-7.  Alignment  for  Target  T0129.  All  alignments  were  generated  by  the  DARWIN 
package.  The  sequence  for  which  the  crystal  structure  was  unknown  is  b,  and 
is  indicated  by  a *. 


77 


C - _EQILLAMSAQLVDTD_FELSLLLPEGEGEEALFELADAVAEWINHFISGLGLSGANLKHASVEAKEALEDLEE 

* * * * ## 

isSIiSiisSsISS . S. IsIslilsSss . S. sllssisiliSIis . Hi . I . I . isSiSsssssisAIIsAIsS 
IisIIsilisIIsiiSsiilsIilissS  I IliilsIIIilliillilliilissISsiSSillsIIsilss 


VTELYEQ I SQTLSDVEGFTFELGLTEDE  N 

F 


HHHHHHHHHHHH 

HHHHHHHHHHHHHH 

HHHHHHHHHHHHHH 


EEEEEEE 


VFTQADSLSDWANQFLLGIGLAQPELAKEKGEIGEAVDDLQD 

G HI 

HHHHHHHHHHHHHHHHHHHHHH  HHH  HHHHHHHHHHHH 
HHHHHHHHHHHHHHHH  HHHHHHHHHHHHH 

HHHHHHHHHHHHHHHHHHHHHH  HHHHHHHHHHHH 


140  150  160  170  _ 180 

. I I I — --I -•••!•• 

150  160  170  180  190  200 

• | I I I I I 

d - IAQVQ GQLEDSEDGETDYMEVMEYLRVAPLLLFAECGKPLEPAP KPSLH 

f - LAQASVDDFDMNEEEEDGSLE EIEEFVRVAVLLLHGDC 

e - ICQL GYDESDDKEELSEALEEIIEYVRTLACLL FTHFQPQLPE_QKPVLH 

b*-  ICQL GYDEDDNEEELAEALEE I I EYVRT I AMLF Y SHFNEGEI E_SKPVLH 

g - IAQL GYDEDEDQEELAQSLEEWEYVRVAAILCHIEFTQQKPTAPEMHKPTLH 

h - IAQL GYDESEDQEELEMSLEEIIEYVRVAALLCHDTFTRQQPTAPEVRKPTLH 

a - IAQL GYDEDEDQEELEMSLEEIIEYVRVAALLCHDTFTHPQPTAPEVQKPTLH 

C - MSKL GIDEEDDLAEQAELLEQVIEHIKACVLVLHAEFGVKPEQD TKPTVH 

* 

Iiils . asissSsSsssiSssissliAilsililllsssIsSSSsss .aiSa.  iia 


IlSi 

SiSsSisisslisilisiliillill  il 

Issli 

ICQL 

GYDEDDNEEELAEALEEIIEYVRTIAMLF 

YSHFN 

1 

u 

H 

HHHHHHHHHHHHHHHHHHHHH 

HHHH 

HHH 

HHHHHHHHHHHHHHHHHHHHH 

H 

HHHH 

H 

HHHHHHHHHHHHHHHHHHHHHH 

Figure  5-7.  (Continued) 

Let  us  consider  the  quality  of  the  multiple  sequence  alignment  itself.  It  is  clear  that 
the  automatic  multiple  sequence  alignment  tool  placed  a gap  in  proteins  d and  f where  the 
target  structure  had  a standard  secondary  structural  element  (C).  This  creates  a parse  too 
soon.  Indeed,  the  gapping  in  this  region  is  problematic.  In  addition,  the  helix  has  a Pro  at 
position  3.  This  was  interpreted  as  a parse.  In  our  prediction  however,  we  assign  a five 
site  strand  to  this  region.  This  prediction  was  incorrect  based  on  a gapping  problem.  In  a 
complementary  multiple  sequence  alignment,  the  gaps  were  shifted  by  hand  to  generate 
this  mistake.  This  example  illustrates  the  need  for  improved  gap  placement  tools.  In 
principle,  one  does  not  know  whether  this  represents  a false  placement  of  the  gap,  or 


78 


whether  a segment  of  the  standard  secondary  structural  element  was  in  fact  removed. 

Brian  Matthews,  for  example  at  the  University  of  Oregon,  argues  that  it  is  acceptable  to 
remove  by  deletion  a single  turn  of  the  helix.  Nonetheless,  it  is  clear  that  the  incorrect 
assignment  of  a strand  instead  of  helix  C arose  in  large  part  because  of  a misplaced  gap. 

A third  class  of  misprediction  arises  from  the  naivety  of  the  simple  heuristic. 

Internal  helices  are  not  expected  to  reveal  themselves  from  patterns  in  their  surface  and 
interior  assignments.  It  is  quite  clear  that  internal  helix  B was  missed  because  of  this 
problem  weakness  in  the  heuristic.  A helix  was  identified  in  the  experimental  structure, 
but  a strand  was  predicted.  In  this  case,  the  experimental  secondary  structure  assignment 
identifies  a helix  on  segment  027-040,  the  first  two  positions  of  a helix  are  aligned  to  a 
distributed  parse  (Pro)  in  the  multiple  sequence  alignment  and  survive  a conserved  Gly  in 
position  033.  We  then  examined  the  details  of  that  structure.  Two  functional  amino 
acids  are  conserved  at  positions  023  and  025,  a glutamate  and  a histidine.  This  may 
indicate  an  active  site,  and  these  residues  were  so  assigned. 

This  then  raises  the  question,  how  were  we  able  to  correctly  identify  internal  helix 
5?  The  answer  here  arises  from  the  incorrect  assignment  of  interior  and  surface  residues 
that  are,  according  to  the  DSSP  file,  buried  even  though  they  are  on  the  surface.  In  each 
case,  the  burying  is  not  complete,  it  seems,  as  the  DSSP  assignment  contains  a number  of 
a weak  interior  positions.  Further,  some  of  the  sites  that  are  assigned  to  be  inside  clearly 
cannot  be  inside,  at  least  judging  from  the  fact  that  they  have  variation  in  evolving 
charged  amino  acids  even  within  some  of  the  subfamilies.  The  conclusion,  therefore,  is 
that  the  automatic  server  prediction,  while  not  necessarily  a good  indicator  of  side  chain 


79 


exposure  per  say  is  nevertheless  an  acceptable  indicator  of  side  chains  accessibility  when 
it  comes  to  using  this  simple  heuristic. 

In  short,  the  principal  difficulties  in  this  prediction  rose  from  two  well  understood 
problems  with  this  approach.  The  first  problem  is  that  the  proper  approach  can  be 
defeated  by  misplaced  gaps.  This  is  especially  true  when  the  number  of  sequences  are 
small  and.  Second,  this  approach  does  not  easily  identify  internal  helices,  especially 
when  they  are  short. 

A comment  needs  to  be  made  about  to  the  misprediction  of  a part  of  the  helix  7 as  a 
strand.  This  helix  was  correctly  predicted  even  though  protein  f has  a gap  that  is 
misplaced.  The  manual  predictors  did  move  the  dipeptide,  LE,  in  some  of  their  multiple 
sequence  alignments  in  which  they  adjusted  by  hand.  The  predictors,  however,  were 
clearly  aware  of  the  fact  that  they  were  dealing  with  a large,  unparsed  protein  segment. 
The  manual  predictors  were  very  uncomfortable  with  this  and  considered  as  a second 
choice  to  break  this  very  long  helix  and  into  two  pieces. 

In  conclusion,  target  TO  129  should  be  taken  as  an  example  of  a prediction  that  had 
too  few  sequences  to  provide  the  information  needed  to  predict  this  structure  correctly. 
Therefore,  it  is  not  surprising  that  some  the  CASP  judges  found  that  no  one  obtained  the 
correct  structure  while  just  one  group  got  the  overall  fold  correctly.  Nevertheless,  it 
appears  as  if  some  groups  got  the  secondary  structure  correctly  including  the  Jones- 
NewFold  and  Doniach  groups. 

We  also  made  another  serious  mistake  by  placing  a beta  strand  (F)  in  what  is  an 
extended  linker  region.  This  is,  perhaps,  not  a mistake.  In  F,  we  found  a good 
amphiphilic  pattern;  however,  the  segment  is  not  a strand  according  to  the  experimental 


80 


structure.  In  this  case,  the  experimental  surface  and  interior  assignments  agreed  with  the 
theoretical  prediction  in  6 out  of  7 positions.  Only  the  second  position,  which  was 
predicted  as  a weak  surface  assignment  (s),  was  actually  a weak  interior  assignment  (i). 

A straightforward  way  to  find  a beta  strand  is  to  look  at  (j)  and  jangles.  The  second  way 
is  to  look  at  the  relative  orientation  of  the  side  chains.  This  extended  strand  clearly  does 
not  have  the  hydrogen  bonding  that  it  would  need  to  have  to  be  part  of  a beta  sheet.  To 
answer  this  question  we  must  consider  whether  or  not  the  0and  jangles  and  relative 
orientation  of  the  side  chains  agree  with  a beta  orientation.  Table  5-4  indicates  that  the 
prediction  of  segment  F is  not  in  fact  a strand. 

In  this  prediction,  you  can  see  a conscious  influence  of  thoughts  about  possible 
tertiary  structure  influencing  the  assignment  of  secondary  structure.  Once  one  has  one 
beta  strand  that  one  is  relatively  confident  of,  then  the  manual  predictors  seek  to  have  a 
second,  third  and  maybe  even  a fourth  beta  strand,  simply  to  have  a sufficient  number  of 
beta  strands  from  which  to  build  a beta  sheet.  The  notes  from  the  manual  predictor 
indicate  strong  confidence  in  the  extended  structure  recognizing  that  the  small  number  of 
sequences  in  the  alignment  was  likely  to  make  the  overall  production  work  inaccurately. 

It  is  interesting  to  compare  the  results  of  an  automatic  server,  PHD,  for  this 

segment.  PHD  provides  a neural  network  implementation  of  Benner-Gerloff  method 

(Rost  and  Sander  1994).  The  results  of  the  PHD  server  are  shown  in  Table  5-5. 

Table  5-4.  Dihedral  angles  for  segment  F in  TO  129  

Amino  acid  (position)  Dihedral  angles  [°] 


Phi 

Psi 

PHE 

F 

(082) 

60.95 

51.02 

THR 

T 

(083) 

-107.34 

-7.93 

PHE 

F 

(084) 

-65.85 

129.32 

GLU 

E 

(085) 

-117.40 

149.42 

LEU 

L 

(086) 

-75.91 

-30.46 

GLY 

G 

(087) 

76.45 

42.23 

81 


Table  5-5.  PHD  results  for  T0129 


AA 

PHD 

AA 

PHD 

AA 

PHD 


MLISHSDLNQQLKSAGIGFNATELHGFLSGLLCGGLKDQSWLPLLYQFSNDNHAYPTGLVQPVTE 
HHHHHHHHHHH  HHHHHHHHHHEE  HHHHHHHHHHH  HHHHHHHHHH 

LYEQISQTLSDVEGFTFELGLTEDENVFTQADSLSDWANQFLLGIGLAQPELAKE 
HHHHHHHHH  EEEEE  HHHHHHHHHHHHHHHHHHHH 

KGEIGEAVDDLQDICQLGYDEDDNEEELAEALEEIIEYVRTIAMLFYSHFNEGEIESKPVLH 

HHHHHHHHHHHHHH  HHHHHHHHHHHHHHHHHHHHHHHHH 


The  PHD  server  has  the  same  problem  as  we  have.  The  automatic  server  also 
predicts  a beta  strand  in  the  extended  region  where  we  predict  it.  The  server  evidently 
does  not  mistake  internal  helix  two  (helix  B)  or  helix  three  (helix  C).  In  this  particular 


case  it  looks  as  if  the  PC  server  did  better. 

The  prediction  of  target  TO  129  indicates  that  our  method  over-predicts  helices. 
The  other  significant  errors  are  related  to  parsing  and  conservation.  In  element  B,  the 
error  is  associated  with  parsing  and  an  active  site.  In  element  C,  the  error  is  related  to  the 
poor  conservation  within  the  multiple  sequence  alignment.  The  moral  from  this 
prediction  is  that  when  you  enter  a contest;  do  not  make  predictions  for  families  with  few 


structures. 


T0148 

Target  TO  148  was  submitted  by  an  unnamed  crystallographer.  The  protein, 
designated  HI  1034,  is  from  Haemophilus  influenzae  and  is  163  amino  acids  in  length. 

The  MasterCatalog  (GenBank  Version  129)  yielded  a total  of  22  homologous  proteins  in 
a preconstructed  family  designated  MC7610  as  shown  in  Figure  5-8. 

The  PAM  width  of  the  family  was  just  under  100.  No  bridges  were  found  to  other 
proteins  that  might  expand  the  tree.  It  later  turned  out  of  that  the  protein  had  a distant 
homology  whose  crystal  structure  was  known.  That  homolog  was  the  ribosome  recycling 
factor  a+  f domain.  The  family  members  were  dispersed  into  2 subfamilies  with  PAM 


82 


widths  of  ca.  75  and  2 out  groups.  This  small  number  of  sequences  implies  that  the 
predicted  structure  would  have  low  overall  reliability.  The  balance  in  the  tree,  however, 
implied  that  the  few  sequences  are  strategically  placed,  making  a prediction  effort 
worthwhile. 

The  22  sequences  from  the  MasterCatalog  family  (MC7610)  were  reduced  to  15, 
based  on  redundancy,  and  submitted  to  the  Darwin  server.  These  were  a clustered  into 
two  subfamilies  with  pairs  that  had  distances  of  10  PAM  units  groups  of  these  separate 
by  approximately  50  PAM  units,  with  the  overall  distance  close  to  100  PAM  units.  The 
corresponding  multiple  sequence  alignment  is  shown  in  Figure  5-9.  The  target  is  listed  as 
sequence  “g”.  The  Darwin  server  also  prepared  the  strong  and  weak  surface  assignments 
(S,s),  the  strong  and  weak  interior  assignments  (I,i),  and  the  strong  and  weak  “active  site” 
assignments  (A,a). 


Figure  5-8.  Phylogenetic  trees  for  Target  T0148,  MC  family7610. 


83 


Parses  were  readily  assigned  to  positions  008-009  (based  on  the  parsing  string  PS), 
058-061  (based  on  the  distributed  parsing  strings  SD,  SDD,  and  GXDS,  inter  alia),  094- 
095  (based  on  the  parsing  string  SG),  and  160-162  (based  on  the  parsing  string  GXP). 

Helices  were  then  sought  by  looking  for  the  pattern  of  interior  and  surface 
assignments  between  parses.  Several  regions  stood  out  as  potential  helices  such  as  the 
one  anchored  on  segment  010-016.  Amphiphilicity  could  be  seen  for  the  010-027 
segment  and  is  broken  by  position  028,  with  position  017  being  incorrectly  placed  as  a 
surface  residue  (albeit  not  a strong  one)  on  the  interior  arc.  This  assignment  also  assumes 
that  the  conserved  E at  position  020  lies  on  the  interior  arc. 


10  20  30  40  50  60 

I I I I I I 

10  20  30  40  50  60  70 


C - SFDISAALDKQELKNAFEQAKKELDSRYDLKGIKCEIDLSEKESIFKLSSSSEGKLDVLKDIVISK 

f _ SFDLVSEVNLQEVDNAINLAMKEITNRYDFKGSKSSIERTGDE_QVTLISDDEYKLESVIDILKSK 

a _ SFDIVSKVELPEVONAIO IALKEISTRYDFKGSKSDI SLDKEE  LVLVSDDEFKLSOLKDVLVSK 

d - SFDIVSKVERQEVDNALNQAAKEISQRYDFKGVGASISWSGEK ILMEANSEDRVTAVLDVFQSK 

b _ SFDIVSKVDRQEVDNALNQAAKELATRFDFRGTDTKIAWKGDE_AVELTSSTEERVKAAVDVFKEK 

m - SFDIVSDFDRQELVNAVDQVIRDLKSRYDLKDTQTTVEL_GEE_KITIGTDSEFTLESVHNILREK 

i - MPSFDWSELDKHELTNAVDNAIKELDRRFDLKG_KCSFEA KDKSVTLTAEADFMLEQMLDILRSN 

k - DIVSEVDTVELRNAVDNANRELSTRFDFRNVEAGFEL KDEWKLSAEDDFQLGQMMDILRGN 

h - MPSFDIVSEIDAVELRNAVENSTRELASRFDFRNVDASFEL KEETVKLAAEDDFQLGQMMDILRGN 

n - MKGEEKMPSFDIVSEVDLQEARNGVDNAVREVESRFDFRGVEATIELNDANKTIKVLSESDFQVNQLLDILRAK 
e - MKGEEKMPSFDIVSEVDLQEARNAVDNASREVESRFDFRNVEASFELNDASKTIKVLSESDFQVNQLLDILRAK 

1 - MPSFDIVSEIDMQEVRNAVENATRDLANRWDFRNVPASFELNEKNESIKWSESDFQVEQLLDILRAQ 

j - MPSFDIVSEITLHEVRNAVENANRDLTNRWDFRNVQAAIELNEKNESIKVSSESDFQVEQLVDILRNA 

g*- MPSFDIVSEITLHEVRNAVENANRVLSTRYDFRGVEAVIELNEKNETIKITTESDFQLEQLIEILIGS 

O - MPSFDWCEANMVELKNAVEQANKEISTRFDFKGSDARVEH KDQELTLFGDDDFKLGQVKDVLLTK 

* ★ ★ ★ * 

ia . aaai . aiAIi . sIssiAIsAilsiissSIsSAIAIsSiSiSIsisssSssisIsssssi  . IsiliSIISSs 


PPeeee  p ppp  PPPPPPP  PPPP 

hhhhhhhhhhhhhhhhhh  eeeee  eeeeee  eee  hhhhhhhhhh 

Figure  5-9.  The  alignment  for  Target  T0148.  There  are  two  sets  of  numbering;  the  upper 
set  corresponds  to  the  target  sequence  and  the  lower  set  refers  to  the  overall 
alignment  and  the  numbering  in  this  discussion.  All  alignments  were 
generated  by  the  DARWIN  package.  The  sequence  for  which  the  crystal 
structure  is  unknown  is  g,  and  is  indicated  by  a *. 


84 


70  80  90  100  110  12 0_  130  140 

• I I I I I I- i I- 

80  90  100  110  120  130  140  1 

I I I I I I I 

C - LIKRGINPKAIK  ELSRESGAMFRLNLKANDAIDSENAKKIMKVIKDSKLK_VNSSIRGEEIRVAAKQIDDLQ 
f - FIKRGLSQKTMDFGKIERAAGGTVRQWTLLSGIEGERAKKLTKLIRDSKLK_VKAQIQNDQIRVTGKNIDDLQ 
a - LIKRNVPTKNIDYGKVENASGGTVRQRAKLVQGIDKDNAKKINTI IKNSGLK_VKSQVQDDQVRVTGKNKDDLQ 
d - LIKRGISLKALDAGE_PQLSGKEYKIFASIEEGISQENAKKVAKLIRDEGPKGVKAQVQGEELRVSSKSRDDLQ 
b - LIRRDISLKAFEAGE_PQASGKTYKVTGALKQGISSENAKKITKLIRDAGPKNVKTQIQGDEVRVTSKKRDDLQ 
m - AAKRNLSQKIFDFGKVESASGNRVRQEITLKKGISQDIAKQISKLIRDE_FKKVQASIQGDAVRVSAKAKDDLQ 
i - LVKRKVDSQCMEI_KDAYPSGKWKQDVNFREGIDKDLAKKIVGLIKERKLK_VQAAIQGEQVRVTGKKRDDLQ 
k - LAKRGVDAKAME_AKDSVHSGKRWFKDVQFKQGLDPLTSKKWKAIKDAKLK_VQASIQGEKVRVTGKKRDDLQ 
h - LAKRGVDARAMK_AKDSVHIGKNWYKEAEFKQGLEALLAKKIVKLIKDAKIK_VQASIQGDKVRVTGKKRDDLQ 
n - LLKRGIEGASLDVPDEFVHSGKTWYVEAKLKQGIESAVQKKIVKLIKDSKLK_VQAQIQGEEIRVTGKSRDDLQ 
e - LLKRGIEGSSLDVPENIVHSGKTWFVEAKLKQGIESATQKKIVKMIKDSKLK_VQAQIQGDEIRVTGKSRDDLQ 
1 - LSKRGIEGAALEIPEEMARSGKTYSVDAKLKQGIESVQAKKLVKLIKDSKLK_VQAQIQGEQVRVTGKARDDLQ 
j - CIKRGIDSGSLDIPTEYEHSGKTYSKEIKLKQGIASEMAKKITKLIKDSKLK_VQTQIQGEQVRVTGKSRDDLQ 
q*-  CIKRGIEHSSLDIPAESEHHGKLYSKEIKLKQGIETEMAKKITKLVKDSKIK_VQTQIQGEQVRVTGKSRDDLQ 
o - LAKRGVDVRFLDYQDKQKIGGDKMKQWKIKKGVSGELSKKIVKLIKDSKIK_VQGSIQGDAVRVSGAKRDDLQ 
* •*  ★ * * ★★★★★★ 

IisAsISss . I Si . SSsssS .ssIssSisIss. ISssSiAsI . sIIssssiAsli . . IisSsIAI . . sssAAIA 
pp  PP  PPPPeeeeeeeee  PPPP  pPPPPPeeeee  pp  eeeppp 

hhhhhhhh  hhhhhh  hhhhhhhhhh 

150  160  170 


50  160  170 


C - AVMKLVKELDLELNI SFKN- - 
f - QVIQLVKEQDLDFPVQFVNMR 
a - QIISAVRGADLPIDVQFINFR 
d - TVI SLLKGQDFDFALQFVNYR 
b - AVIAMLKKADLDVALQFVNYR 
m - IVIQRLKQEDYPVALQFTNYR 
i - EAIALLRGESLGMPLQFTNFRD 
k - SVIALMRESDMGQPFQYDNFRD 
h - EVMAMLREANLEQ  PLQYNNFRE 
n - SVMALVRGDDLGQPFQFKNFRD 
e - AVMAMVRGGDLGQPFQFKNFRD 
1 - AVMALVRAADLGQPFQFNNFR 
j - AVIQLVKGAELGQPFQFNNFRD 
g*-  AVIQLVKSAELGQPFQFNNFRD 
O - AVIAMLRKDVTDTPLDFNNFRD 


slliilssSSIsI . I . IsAias 
PPPeeeeee 


Figure  5-9.  (Continued) 

However,  the  helix  from  016-034  is  very  good.  A potential  strand  is  located  at 
positions  035-039.  This  strand  displays  alternating  periodicity  assuming  that  the 
conserved  R and  D,  at  positions  035  and  037,  are  hydrophilic.  The  amphiphilicty  of  the 
strand  from  054-057  is  adequate  (IKIT).  Another  potential  helix  is  anchored  on  segment 
065-072.  Amphiphilicity  could  be  seen  for  the  064-073  segment  and  is  broken  by 


85 


position  074.  This  assignment  is  even  more  appealing  since  the  breaks  are  on  the  same 
side  of  the  wheel. 

The  helices  in  this  protein  are  intriguing.  For  example,  in  the  region  070-090,  two 
helices  can  be  placed  with  excellent  amphiphilicity,  they  overlap,  but  the  amphiphilicity 
is  not  consistent  over  the  whole.  The  same  is  true  at  the  beginning  of  the  sequence, 
position  010-027.  This  makes  secondary  structure  difficult  to  assign,  but  is  undoubtedly 
revealing  about  the  tertiary  structure  of  the  protein. 


(e) 


(0 


(g) 


Figure  5-10.  The  helical  wheel  segments  of  T0148. 


86 


The  region  from  095-110  is  problematic.  Helices  can  be  built  from  096-1 10,  and 
the  amphiphilicity  start  at  ca.  092  if  we  ignore  the  SGG  parsing  string.  The 
amphiphilicity  beginning  at  position  096  is  broken  by  a strong  interior  assignment  at 
position  104.  A helix  from  103-110  appears  to  be  a reasonable  assignment  as  well.  It  is 
interesting  to  note  that  the  097-105  segment  showed  alternating  periodicity,  and  was 
strongly  assigned  as  a strand.  This  assignment  assumes  that  the  strand  survives  the  lack 
of  a hydrophobic  residue  at  position  1 00. 

Another  potential  helix  is  observed  from  positions  1 13-122.  Supporting  this  case  is 
the  fact  that  this  helix  begins  and  ends  on  the  same  arc,  and  there  is  an  inside/surface 
ambiguity  at  the  interface  between  the  surface  and  interior  arcs.  A strand  is  assigned  to 
positions  129-133  (VQTQI).  A weaker  strand  centers  on  the  conserved  R;  positions  137- 
140  (VRVT).  The  last  helix  is  secure  in  the  sense  that  it  begins  and  ends  on  the  same  arc. 

Target  TO  148  displays  all  of  the  challenges  associated  with  a prediction  for  a 
family  with  relatively  few  sequences  and  relatively  little  overall  sequence  divergence.  1 7 
out  of  the  170  positions  in  the  alignment  were  the  same  in  all  proteins.  With  so  little 
divergence,  it  is  difficult  to  find  information  that  is  presented  in  more  variable  families; 
information  that  is  valuable  to  assigning  surface  positions  in  particular.  Further,  there  is 
little  opportunity  for  gapping  that  might  reveal  parses. 

Thus,  segment  009-040  has  no  effective  parses.  Only  and  SD  at  positions  014-015, 
a single  P at  position  019,  and  a GxDN  at  positions  024-027  are  potential  parses.  The 
segment  is  littered  with  conserved  functional  residues.  The  case  for  a helix  is  made  by 
the  pattern  of  amphiphilicity  from  positions.  This  secondary  structural  element  is 


87 


anchored  at  positions  14-18,  but  requires  that  the  APC  D at  position  Oil  be  a surface 
residue. 

Results  for  T0148 

TO  148  is  a hypothetical  protein  encoded  by  the  gene  YajQ  of  Haemophilus 
influenzae.  This  protein  was  selected  for  X-ray  crystallographic  structure  determination 
and  analysis  to  assist  with  the  functional  assignment  for  the  CASP  5 project.  The  amino 
acid  sequence  has  no  homology  to  that  of  other  proteins. 


Figure  5-11.  Ribbon  representation  of  the  YajQ  monomer.  The  figure  on  the  right 
represents  what  we  did  right  (green)  and  what  we  missed  (red). 

The  YajQ  protein  was  cloned,  expressed,  and  the  crystal  structure  determined  at  2.1 
A resolution  by  applying  the  multiwavelength  anomalous  dispersion  method  to  a mercury 
derivative.  The  polypeptide  chain  is  folded  into  two  domains  with  identical  folding 
topology;  a four-stranded  antiparallel  / 3 sheet  flanked  on  one  side  by  two  a helices.  This 
structural  motif  is  a characteristic  feature  of  many  RNA-binding  proteins.  The  tetrameric 


88 


structure  observed  in  the  crystal  suggests  a possibility  of  binding  two  stretches  of  double- 


stranded  nucleic  acid. 


10  20  30  40  50  60 

I I I I I I 

10  20  30  40  50  60  70 


I '••••! I , | , | | | 

c _ SFDISAALDKQELKNAFEQAKKELDSRYDLKGIKCEIDLSEKESIFKLSSSSEGKLDVLKDIVISK 

f _ SFDLVSEVNLQEVDNAINLAMKEITNRYDFKGSKSSIERTGDEQV_TLISDDEYKLESVIDILKSK 

a _ SFDIVSKVELPEVQNAIQIALKEISTRYDFKGSKSDISLDKEELV LVSDDEFKLSQLKDVLVSK 

d _ SFDIVSKVERQEVDNALNQAAKEISQRYDFKGVGASISWSG_EKI_LMEANSEDRVTAVLDVFQSK 

b - SFDIVSKVDRQEVDNALNQAAKELATRFDFRGTDTKIAWKGDEAV_ELTSSTEERVKAAVDVFKEK 

m _ SFDIVSDFDRQELVNAVDQVIRDLKSRYDLKDTQTTVEL_GEEKI_TIGTDSEFTLESVHNILREK 

i - MPSFDWSELDKHELTNAVDNAIKELDRRFDLKG_KCSFE_AKDKSVTLTAEADFMLEQMLDILRSN 

k - DIVSEVDTVELRNAVDNANRELSTRFDFRNVEAGFEL KDEWKL S AEDDFQLGQMMD I LRGN 

h - MPSFDIVSEIDAVELRNAVENSTRELASRFDFRNVDASFEL KEETVKLAAEDDFQLGQMMDILRGN 

n - MKGEEKMPSFDIVSEVDLQEARNGVDNAVREVESRFDFRGVEATIELNDANKTIKVLSESDFQVNQLLDILRAK 
e - MKGEEKMPSFDIVSEVDLQEARNAVDNASREVESRFDFRNVEASFELNDASKTIKVLSESDFQVNQLLDILRAK 

1 - MPSFDIVSEIDMQEVRNAVENATRDLANRWDFRNVPASFELNEKNESIKWSESDFQVEQLLDILRAQ 

j - MPSFDIVSEITLHEVRNAVENANRDLTNRWDFRNVQAAIELNEKNESIKVSSESDFQVEQLVDILRNA 

g*_  MPSFDIVSEITLHEVRNAVENANRVLSTRYDFRGVEAVIELNEKNETIKITTESDFQLEQLIEILIGS 

O - MPSFDWCEANMVELKNAVEQANKEI STRFDFKGSDARVE HKDQELTLFGDDDFKLGQVKDVLLTK 

•k  -k  * * ★ 


PDARia . aaai . aiAIi . sIssiAIsAilsiissSIsSAIAIsSiSiSIsisssSssisIsssssi . IsiliSIISSs 

DSSP  (A) 

DSSP  (B) 

TARG 


EXPT 

PRED 

Baker 

70 


80  90  100  110  120  130  140 

I I I I I I | 

C - LIKRGINPKAIK ELSRESGAMFRLNLKANDAIDSENAKKINKVIKDSKLK_VNSSIRGEEIRVAAKQIDDLQ 

f - FIKRGLSQKTMDFGKIERAAGGTVRQWTLLSGIEGERAKKLTKLIRDSKLK_VKAQIQNDQIRVTGKNIDDLQ 
a - LIKRNVPTKNIDYGKVENASGGTVRQRAKLVQGIDKDNAKKINTI IKNSGLK_VKSQVQDDQVRVTGKNKDDLQ 
d - LIKRGISLKALDAGE_PQLSGKEYKIFASIEEGISQENAKKVAKLIRDEGPKGVKAQVQGEELRVSSKSRDDLQ 
b - LIRRDISLKAFEAGE_PQASGKTYKVTGALKQGISSENAKKITKLIRDAGPKNVKTQIQGDEVRVTSKKRDDLQ 
m - AAKRNLSQKIFDFGKVESASGNRVRQEITLKKGISQDIAKQISKLIRDE_FKKVQASIQGDAVRVSAKAKDDLQ 
i - LVKRKVDSQCMEI_KDAYPSGKWKQDVNFREGIDKDLAKKIVGLIKERKLK_VQAAIQGEQVRVTGKKRDDLQ 
k - LAKRGVDAKAME_AKDSVHSGKRWFKDVQFKQGLDPLTSKKWKAIKDAKLK_VQASIQGEKVRVTGKKRDDLQ 
h - LAKRGVDARAMK_AKDSVHIGKNWYKEAEFKQGLEALLAKKIVKLIKDAKIK_VQASIQGDKVRVTGKKRDDLQ 
n - LLKRGIEGASLDVPDEFVHSGKTWYVEAKLKQGIESAVQKKIVKLIKDSKLK_VQAQIQGEEIRVTGKSRDDLQ 
e - LLKRGIEGSSLDVPENIVHSGKTWFVEAKLKQGIESATQKKIVKMIKDSKLK_VQAQIQGDEIRVTGKSRDDLQ 
1 - LSKRGIEGAALEIPEEMARSGKTYSVDAKLKQGIESVQAKKLVKLIKDSKLK_VQAQIQGEQVRVTGKARDDLQ 
j - CIKRGIDSGSLDIPTEYEHSGKTYSKEIKLKQGIASEMAKKITKLIKDSKLK_VQTQIQGEQVRVTGKSRDDLQ 
g*-  CIKRGIEHSSLDIPAESEHHGKLYSKEIKLKQGIETEMAKKITKLVKDSKIK_VQTQIQGEQVRVTGKSRDDLQ 
o - LAKRGVDVRFLDYQDKQKIGGDKMKQWKIKKGVSGELSKKIVKLXKDSKIK_VQGSIQGDAVRVSGAKRDDLQ 
* * * * * **  **** 


illllllsIiisiliillillisIISiliilsSisIilsIiiSsillillliiisilsIIIsIIIII 

illllllsIsIsilisIIiilisIIsIIiilsSIsIilsIisSsillillliissIIsIIIilllil 

PSFDIVSEITLHEVRNAVENANRVLSTRYDFRGVEAVIELNEKNETIKITTESDFQLEQLIEILIGS 


A 

f.3JS«  V 

D 

E 

F 

L* 

L \ 

EEEEE 

EEEE 

HHHHH 

HHHHHHHHHHHHHHH  HHH 

HHHHHHHHHHHHHHHHH  EEEE 
HHHHHHHHHH  HHHHHHHH 

EEEEEE 

EEEEE 

EEEE 

EEEEEE 

EEEEE 

EEEEEE 

HHHHHHHHHHHHHH 

HHHHHHHHHH 

HHHHHHHHHHHHHH 

80 

90  100 

110 

12  0_ 

130  140 

Figure  5-12.  The  alignment  for  Target  T0148.  The  sequence  for  which  the  crystal 
structure  is  unknown  is  g,  and  is  indicated  by  a *. 


89 


IisAsISss . ISi . SSsssS . ssIssSisIss . ISssSiAsI . sIIssssiAsli . . IisSsIAI . . sssAAIA 
iiiSIsiillsiiSSiissSsiilliIsliillsssIIsillsilssiSiS  IsisiiSsilililSisSils 
IiiisIiiillsiiSSissiSSiillsIsIIillisSIIsillsilssiSiS  IsiiiiSiilililSiSSils 
CIKRGIEHSSLDIPAESEHHGKLYSKEIKLKQGIETEMAKKITKLVKDSKIK  VQTQIQGEQVRVTGKSRDDLQ 

GHIJK  L M NO 


V 

iy\ 

HHHEE 

ilfc. 

| p 

HHH 

EEE  EEEEEEEE 

HHHHHHHHHHHHHH 

EEEEEE 

EEEEEE 

HHHHH 

EEE 

HHHHHHHHHHHHHHH  HHHHHHHHHHHHH 

EEEEE 

EEEE 

HH 

HH 

HHH 

EEEEE 

HHHHHH  HHHHHHHHH 

EEEEEE 

EEEEEE 

HHHHH 

150 

. . . 1 

160 

. . . | . . . 

150 

160 

170 

c - AVMKLVKELDLELNISFKN-- 
f - QVIQLVKEQDLDFPVQFVNMR 
a - QIISAVRGADLPIDVQFINFR 
d - TVI SLLKGQDFDFALQFVNYR 
b - AVIAMLKKADLDVALQFVNYR 
m - IVIQRLKQEDYPVALQFTNYR 
i - EAIALLRGESLGMPLQFTNFRD 
k - SVIALMRESDMGQPFQYDNFRD 
h - EVMAMLREANLEQPLQYNNFRE 
n - SVMALVRGDDLGQPFQFKNFRD 
e - AVMAMVRGGDLGQPFQFKNFRD 
1 - AVMALVRAADLGQPFQFNNFR 
j - AVIQLVKGAELGQPFQFNNFRD 
g*-  AVIQLVKSAELGQPFQFNNFRD 
O - AVIAMLRKDVTDTPLDFNNFRD 


slliilssSSIsI . I . IsAias 
illsilsSiSIillliliisiS 
illsilsSiSIslililiissS 

AVIQLVKSAELGQPFQFNNFRD 
HHHHHHHHH  EEEE 

_ t 

HHHHHHHH  EEEEEE 

HHHHHHHH  EEEE 

HHHHHHHHH  EEEE 


Figure  5-12.  (Continued) 

The  parsing  tool  reproduced  the  inter-strand  junctions  perfectly.  This  was  true  even 
in  the  cases  where  the  secondary  structural  elements  predicted  between  parses  were 
predicted  incorrectly. 

The  missed  element  H and  the  mispredicted  element  C both  appear  to  reflect 
erroneous  assignment  of  secondary  structure  by  the  contest  auto-judge.  There  are  no 
alpha  helices  that  cover  just  three  residues.  Inspection  of  the  crystal  structure  showed  that 


90 


these  elements  were  not  in  fact  helices,  but  rather  turns  near  the  active  site.  This  is  an 
example  of  the  third  error  based  on  the  weaknesses  of  the  method. 

We  then  examined  the  crystal  structure  to  assess  the  role  of  the  two  missed  strands, 
labeled  I and  J,  in  the  fold.  These  are  two  edge  strands  that  are  hydrogen  bonded  to 
strand  K,  together  with  a loop  between  them.  They  are  not  core  elements  of  the  sheet,  but 
they  do  form  hydrogen  bonds  with  another  strand.  This  is  also  a difficulty  of  the  method 
and  is  an  example  of  class  2. 

Thus,  the  only  mistake  worthy  of  comment  is  the  misprediction  of  element  K as  a 
helix.  This  region  is  assigned  as  a strand  by  the  auto-judge.  Inspection  of  the  crystal 
structure  shows  that  it  is  a core  element.  The  mispredicted  strand  K represents  an 
interesting  problem.  As  mentioned  in  the  discussion  of  this  segment,  this  region  was 
problematic.  Despite  making  the  comment  that  the  097-105  segment  showed  alternating 
periodicity,  a long  helix  was  predicted.  This  segment  may  have  been  predicted  correctly 
if  more  emphasis  was  placed  on  the  APC  Gly  (14  out  of  15  positions)  at  position  107  in 
Figure  5-12.  However,  it  is  encouraging  that  no  other  helices  were  ambiguous  in  the 
prediction  of  TO  148. 

Returning  to  the  original  prediction  document,  it  was  clear  that  the  surface  and 
interior  assignments  fit,  but  a prediction  was  not  made.  The  surface  and  interior 
predictions  clearly  fit  on  a long  helical  wheel.  We  then  asked  whether  the  surface  and 
interior  assignments  were  themselves  correct.  The  automated  server  predicted  5 surface 
residues  and  3 interior  residues  for  element  K.  The  experimental  results  indicate  that 
there  are  only  2 surface  residues  and  6 interior  residues  for  Chain  B and  only  1 surface 
residue  and  7 interior  residues  for  Chain  A. 


91 


Overall,  this  prediction  is  essentially  perfect.  The  principal  culprit  here  is  unit  K. 
We  assigned  this  segment  incorrectly  because  of  the  incorrect  prediction  of  the  surface 
and  interior  assignments.  The  experimental  assignment  from  the  two  DSSP  structures 
indicates  that  this  unit  is  indeed  a buried  strand. 

T0149 

Target  TO  149  was  submitted  by  an  unnamed  crystallographer.  The  protein, 
designated  yjiA,  is  from  Escherichia  coli.  The  target  protein  is  3 1 8 amino  acids  in  length. 
The  MasterCatalog  contained  the  target  in  module  232016.  The  family  was  divided  into 
5 subfamilies  based  on  the  tree  structure.  These  were  independently  submitted  to  the 
Darwin  server.  The  server  generated  the  five  trees  shown  in  Figure  5-13,  and  the  five 
multiple  sequence  alignment  is  shown  in  Figure  5-14. 


Figure  5-13.  Phylogenetic  trees  for  the  5 subfamilies  of  T0149:  (a)  Subfamily  1,  (b) 
Subfamily  2,  (c)  Subfamily  3,  (d)  Subfamily  4,  and  (e)  Subfamily  5. 


92 


(c) 


Figure  5-13.  (Continued) 


93 


(e) 


Figure  5-13.  (Continued) 


Q,  » 91  tf  HI  n Oi  (0  9)  ff  Hi  O & (D  91  O'  HiOCLO) 


94 


Subfamily  1 

10  20  30  40  50  60  70 

I I I I I I I 

b - -VIWGGFLGSGKTTTVINMGKYLAEKGKKIAIIVNEIGEIGIDGDIIKKFGFDTKEITSGCICCTL 

a - -VIWGGFLGSGKTTTIINMGKYLAEKGKKVAIIVNEIGEVGIDGDVINRFGFDTKEITSGCICCTL 

MKCMIIGGFLGSGKTTTIRKLVEYLGAQGQRTAIIVNEIGEVGIDGETISGEGVETREITSGCICCTL 
MKCMIIGGFLGSGKTTTIRKLVEYLGARGQRTAIIVNEIGEIGIDGDIISGGEVETREITSGCICCTL 

- EKGCANMRCIILGGFLGSGKTTTIRKLVETPGIKGEKTAVIVNEIGEIGIDGDTVSAGGVETREITSGCICCTL 
MKVMIIGGFLGSGKTTAIQKISRQLSDAGKRIAIIVNEIGEIGLDGDTLKSPDIKTEELTSGCICCTL 

aa.a.aisIIII. .II . A. AAAil .slisiiiiS. ssiHIIAAI .AI . IA. SSISSSSIsAsAIAA. AIAAAI 

80  90  100  110  120  130  140 

I-.-- I I I I I I 

- KMGLRTTVDTLMKEYKPDILMIEPTGIAFPNSIKNEFKLMNLGEEVKVAPLVTLIDGSRFKHLMKEIKDFAMRQ 

- KVGLRTTMDLLTKEYKPDIVMIEPTGIAFPNVIRTEVELMNLGEEVKVAPLVTLIDGSRFKNLMKEVKEFAMRQ 

- RISMEYTLRNLMASYHPDTIIVEPTGIAFPRQIKSNIESMGL_PEISFTPIVNLVDPSSLSPNVEDLQNFVRNQ 

- RISMEQTLRNLMLTYSPDTIIIEPTGIAFPRQIKSNIESMDL_PEISITPIVNLVDPSRLSPNVEDLQNFVRNQ 

- RISLEQTLRSLMQDYRPETVIIEPTGIAFPKQIKDNILAMNI_PGISMAPIVNILDASRFKPE_ESLQNFIKYQ 

- KISMQYTLQTLEEEFRPEIVIIEPTGIAFPGQIREEIEIMGL_SELSFAPIVTIVDPGRFGTEAKEIPRFIETQ 

* ★ ★ ★****★★*  ★ * ^ .*.*...* * 

sl . IsSAIssIs . SIS. sill IA. A. III. silSSSISi ISI.ssIsIi.il. I IA. . SISSSissI . SIIssA 

150  160  170  180  190  200  210  220 

•I I I I I I I I-- 

- IIDAEILGINKVDLIDSLQIPILEVSVQQLNPTAKWL.LSGKDTGEKFENFIQLVL PGVKDIPN 

- IIDAEILGINKVDLIEPIRIPIIEASVQQLNPKAKWLLSGKDTGERFENFMQLVL PEIKENQE 

- KEDAEILGINKVDLIGQEKLLETCLLLRKLNPRARIVHFSARQGGEGLDKFFRLAG VSSQ 

- IEDAEILGINKVDLIGQEKLLETCLLLRKLNPRARIVHFSARQGGEGLDKFFGLIG VSNK 

- IEDAEVLCINKVDLIDRVSLLEICVFLRKMNPKARIIHFSAREDGEIEELVEILFGGNTGNIKKSTISKSSKRG 

- VKEAEILGINKVDIAPAEIVMATEQMLTELNPEAKILKFSAKLGDE 

******** **.*...  .* 

S . sIAII . IAAIAIISSSSIi . iiiil . . IA. SIsII . IAiSSSsaS . asissiiS . aa . ai . . i . iSSSSSSS 

230  240  250  260  270  280  290 

I I I I I I I 

- KTKKTEEETEPTQQDFENSNDTCGVDGSEENSSNVPYNHGDSFKHT_VGSYSVEYSLKGGRDLSEETARIITTE 

- KTQVTTQKEKPVPSEVQETEDSIKASG VGSYAAEFEIKDGGSLSTEAARNLTAE 

-  DKTACE MRNSVKAKN SIEARNSVEMSGVSAYSTEFEITS_QEISLETAVSVSGQ 

- DKTASE VRNSVKAKN P I EERN SVEMSGV  SAY  STE  F E I TS_Q  E I PLETAV  SVSGQ 

C - OMESSASERTP  PYDRIPVACRVOVGEKI PLEGRN SVAMSGV SAY S S EF EVF S_DSMTLDDA I FVSTE 

f - QFENLLAHLSVPGTAKP PQEKQNSIEISEVSAHSVLYTFKV_CSLNPEKGRFLIEE 


s . SS . SSSSS . SSSSiSSSSSSISSSiSSSsaaa . . iSSSsaisissI . ii . . sISI . S . SSISSSsiisI . ss 
300  310  320  330  340  350  360  370 


b - IMNKIKEKALKLNPEFLGHIKLFLDNGSETVKQSITIYNEEPQEDLFKSNDGATPTFKVLSAVSKVEKEVLKNA 
a - LMNTIKEKVLKLNPEFVGHIKLFLDNGLETVKQSVTIYYEEPQEEIIKSKEGATPTFKILSAVSNVEKEALKDS 
e - ILDSIRKRVTQLNPEFTGHIKLSFAHKENFVKGSVTSAHEKPEIEILKKEKNSRSRLKIISAVASVPREELIRT 
d - ILDSIRTRMTQLNPEFTGHIKVSFAHKENFVKGSVTSAHEKPEIEILKKEKNPSSRLKILSAVTSVPREELIRT 

C - LLESIKSRTRSLNPDFTGHIKLSFSHLDTLIRGSVTSASSSPEIEIFKRGSDTGSKMKVFSAITDVSQKAL 

f - SLQTIRDRVEEMNPEFIGHIKLSLKLPDFMVKGSVTSSEETPQVEFIPR_KNEKFELRLLSAITKVPKDRLRNI 

* ..  . * * . * * * * * *.*.  . * 

ilsSISSSissIA. sli . AIAIil . iSSSils . AIAiiSSS . . iSIISSSSs . S . sIsHAIISSISSSsI . S . 


Figure  5-14.  Multiple  sequence  alignments  for  Subfamily  TO  149.  The  five  multiple 
sequence  alignments  are  labeled  (a)  - (e),  respectively. 


•••*/••••  I ••••/••••  I ••••/•♦••  I ••••/••••  I 

b - VSSSVQETFEKMGIEIQQIKHSHDHEH_HHGHDHHQGHDHHQ 
a - VNSSVHETFEKKGMDVHKVENGHGHGHEHHGHEHHE_HEQHE 

e - VDTTASGKLEDRQLSYIKAEKSN 

d - VDTTASGQLEDRKLSYRKVEKSN 

C 

f - VESTLEAKLKESDISFEKKE 


is . . iSSSiSSSSiSiSS . Ss . . SaSaaaa . aSaaS . aSsaS 


Subfamily  2 


10  20 


30  40  50 


60  70 


e _ VPVTILTGFLGAGKTSLLRSILENRNGKRVAVL 

d _ VSVITGYLGAGKSTLVNYILNGKHGKRIAVI 

1 _ IPVTVLTGYLGAGKTTLLNRILTEEHGKRYAVI 

m - T PVTVLTGYLGAGKTTLLNR I L S E PHGKKFAVI 

n - IPVTVLTGYLGSGKTTLLNRILTENHGKRYAVI 

p _ 1 PVTVLTGYLGAGKTTLLNR  I L S ENHGKKYAVI 

O _ 1 PVTVLTGYLGAGKTTLLNR  I L S ENHGKKYAVI 

q _ 1 PVTVLTGYLGSGKTTLLNR I L S ENHGRKYAVI 

r _ 1 PVTVLTGYLGAGKTTLLNR  I L S ENHGRRYAVI 

k _ VPVTVLTG YLGAGKTTLLNH I LTYEHGKKVAV I 

c _ RIPATIITGFLGSGKTTLLNHILTGDHGKRIAVI 

b - RI  PAT  I ITGFLGSGKTTLLNH I LTRDHGKRI  AVI 

a _ KIPVTILTGFLGAGKTTLLNHILTERHGHRIAVI 


i - MLPAVGSADEEEDPAEEDCPELVPIETTQSEEEEKSGLGAKIPVTIITGYLGAGKTTLLNYILTEQHSKRVAVI 
h - MLPAVGSADEEEDPAEEDCPELVPMETTQSEEEEKSGLGAKIPVTIITGYLGAGKTTLLNYILTEQHSKRVAVI 
g - MLPAMKTVEAEEEYAE_DCPELVPIETKHQEKEENLDFIIKIPVTIVTGYLGAGKTTLLNYILTEQHNRKIAVI 
f - MLPAVKWEAEEEYAE_DCPELVPIETKNQE_EENLDFITKIPVTIVTGYLGAGKTTLLNYILTEQHNRKIAVI 


j - RI  PVSI ITGYLGSGKSTLLEKIALKGADKKIAVI 

★ ★ -k  k k 

ii  . iississaasiiaaaa.aii . iaaSSSasaasisiiSsi . IillA. II . i .Aiills . IlSSSiSSSIIII 

80  90  100  110  120  130  140 


I l I l l ••••<•••• I I ••••'••  • 

e - MNEVGDSGDLERSLMEDVGGEELYEEWVALSNGCMCC TVKDNGIKALEKIMR_QKGRFDNIVIETTGIANPG 

d - LNEFGEEIGVERAMINEGEEGAIVEEWELANGCVCC_TVKHSLVQALEQLVQ_RKDRLDHILLETTGLANPA 
1 - VMEFG  EVGIDNDLW  GAD  EEVFEMMNGCVCC  TVRGDLIRVLOGLMK  RKOGFDATTVF.TTGT.AnPG 

m - VNEFG_EVGIDNDLIV DAD EEVFEMNNGC ICC TVRGDL I RI I EALMR_RRERFDG I L I ETTGLADPA 

n - VNEFG_EIGIDNDLIV ESD E E I YEMNNGC IC C__TVRGDL I RWEGLMR_RPGRFDAI WETTGLAD PV 

p - VNEFG_EIGIDNDLIV ESD EEIYEMNNGCVCC  TVRGDL  I RWEGLMR  RPGRFDGTTVETTGT.ADPV 

0 - VNEFG_EIGIDNDLIV ESD EEIYEMNNGCVCC  TVRGDL  I RWEGLMR  RPGRFDGT  TVETTGT  ,ADPV 

g - VNEFG_EIGIDNDLIV ESD EEIYEMNNGCVCC  TVRGDL  I RWEGLMR  RPGRFDATTVETTGT.ADPV 

r - VNEFG_EIGIDNDLIV ESD EEIYEMNNGCVCC  TVRGDL  I RWEGLMR  RPGRFDAIWETTGT  ,ADPV 

k - VNEFG_EVGIDNQLVI DAD  EEIFEMNNGCICC TVRGDLIRI I SNLMK_RRDKFDHLVI ETTGLADPA 

C - ENEFG  EVDIDGSLVAAOTAGA  EDIMMLNNGCLCC TVRGDLVRMI S EMVQTKKGRFDHIVI ETTGLANPA 

b - ENEFG_EVDIDGSLVASKSIGA EDIVMLNNGCLCC TVRGDLVRMIGELVNTKKGKFDHIVIETTGLANPA 

a - ENEFG_EVDVDSDLVLA SE EEIYQMKNGCICCFVDVRNDLIEVLQKLL_ARKDKFDHILVETSGLADPT 

1 - LNEFGEGSALEKSLAVSQG_GELYEEWLELRNGCLCC SVKDSGLRAIENLMQ_KKGKFDYILLETTGLADPG 

h - LNEFGEGSALEKSLAVSQG_GELYEEWLELRNGCLCC SVKDSGLRAIENLMQ_KKGKFDYILLETTGLADPG 

g - LNEFGEGNAVEKSLAVSQG_GELYEERLELRNGCLCC SVKDNGLKA I ENLMQ_KKGKFDY I LLETTGLADPG 

f - LNEFGEGSAVEKSLAVSQG_GELYEEWLELRNGCLCC SVKDSGLRAIENLMQ_KKGKFDYILLETTGLADPG 

j - LNEFGDSSEIEKAMTIKNGSNS_YQEWLDLGNGCLCC SLKNIGVKAIEDMVERSPGKIDYILLETSGIADPA 

* k k * k k k k . k k *•  **  * * * 

iAAI . SSS . ISSSIIiSSSSSSiiSSSIsIsA. AIAA. . USSSilSIIssIIs . SSSsIA. IIIAAi . IIs . i 


Figure  5-14.  (Continued) 


96 


150  160  170  180  190  200  210  220 

• I I I I I I I I • • 

e - PLAQTFWLDDALKSDVKLDGIVT VIDCKNIDNILKDESDI GFI QISHADCLILN 

d - PLASILWLDDQLESEVKLDC IVTLLEQWDAKNLRFQLNERRDS SSFPEAFNQIAFADTIIMN 

1 - PVAQTFFVDDDVKARTALDSVTAV VDAKHILLRLSDSK EAVEQ IAFADQIVLN 

m - PVAQTFFVDEDVRSKTRLDSI ITV VDAKHLLGEIDRAH EAQEQLAFADTIILN 

n - PVAQTFFMDDDVRAKTGLDAWAL VDAKHLPLRLKDSR EAEDQIAFADWLIN 

p - PVAQTFFMDDDVRAKTELDAWAL VDAKHLPLRLKDSR EAEDQIAFADWWN 

0 - PVAQTFFMDDDVRAKTELDAWAL VDAKHLPLRLKDSR EAEDQIAFADWWN 

q - PVAQTFFMDDDVRAKTELDAWAL VDAKHLPLRLKDSR EAEDQIAFADWLLN 

r - PVAQTFFMDDDVRSKTKLDAWAL VDAKHLPLRLKDSK EAEDQ IAFADVWLN 

k - PVIQTFFVDEDMQSQLSLDAWTL VDAKHIWQHW_DAD EAQEQ IAFADVILLN 

c - P I IQTFYAEDE I FNDVKLDGWTL VDAKHARLHLDEVKPE GYVNEAVEQ I AYADRI IVN 

b - PIIQTFYAEEEIFNDVKLDGWTL VDAKHARLHLDEVKPE GWNEAVEQ I AYADRI  IVN 

a - PVATAFF IDDE IGKHVTLDGIVTL VDAKHIGQHIDDPVLD GRDNQAVDQIVAADRIIIN 

1 - AVASMFWVDAELGSDIYLDGI IT IVDSKYGLKHLTEEKPD GL INEATRQVALADAIL IN 

h - AVASMFWVDAELGSDIYLDGI  IT IVDSKYGLKHLTEEKPD GLINEATRQVALADAILIN 

g - AVASMFWVDAELGSDIYLDGI  IT__WDSKYGLKHLTEEKPD GLVNEATRQVALADMILIN 

f - AVASMFWVDAELGSDIYLDGI  IT WDSKYGLKHLTEEKPD GLVNEATRQVALADMILIN 

j - PIAKMFWQDEGLNSSVYIDGIIT__VLDCEHILKCLDDISIDAHWHGDKVGLEGNLTIAHFQLAMADRIIMN 

* * ★ ★ ★ * 

illillllSSSIs . sIsIAillli . aailAisii .ssISSSSSS.a.a. aa . . .a. . . ssissAIillA. IIIA 


230  240  250  260  270  280  290 

I I I I I I I 

e - KTDLI SSEALSWRQTILKINCLAKIIETTYGRLD D ISEILDLD 

d - KVDLI SQEESDELEKEIHSINSLANVIRS 

1 - KTDLV  SEDDLRHVEARIRRINPLAPIHRAQRSNVPLDAILGKHSFD LERITDLEPDFL_NPA___ 

m - KTDLV  SPEGLOAVEDRIRRINPTAGILKTQRCNLEIASLLDRNAFD LDRILEVEPDFL_E_A 

n - KTDLV  TPEELAAVEATVRAINPHAI IHRTERASI PLDRVLDRGAFD LKRVLDNDPHFL_D 

p - KTDLV  TPEEVARIEDIVRAINPSARIYKTTRSGVDLARVLDQGAFN LERALENDPHFL_E 

0 - KTDLV  TPEEVARIEDIVRAINPSARIYKTTRSGVDLARVLDQGAFN LERALENDPHFL_E 

q - KTDLV  TPEELERVEATVRVINPSARIYRTQRSEIDLGKVLDRGAFN LDRALENDPHFLJD 

r - KTDLV_TPE ELAKVEAT I RA I NPAAK I HRTTRAGVAL S EVLDRGAF D LSRALENDPHFL_E_G 

k - KTDLV__TPS ELDELEKRI RSMNAI AK I YRTRN S ELAMDALLGVKAF D LDRALEIDPNFLGEDA 

c - KTDLV  GEPELASVMORIKTINSMAHMKRTKYGKVDLDYVLGIGGFD LERI ESSVNEEEKEDR 

b - KTDLV  GEABLGSWORIKTINSMAOMTRTKYGNVDLDYVLGIGGFD LERI ES SVNEDDKGD_ 

a - KIDLV  SEGEIAPLERGMRKLNQTAEIVRSSYGRVDLSSILGISGFA PSYVAERAK 

1 - KTDLV  PEEDVKKLRTTIRSINGLGOILETQRSRVDLSNVLDLHAFDSLSGISLQKKLQ 

h - KTDLV_PE EDVKKLRTT IRS INGLGQ I LETQR ISLQKKLQ 

g - KTDLV_SEEELNKLRTTIRSINGLGKVLETQRSRTHLSNILDLHAYDTLSGISLQKKLQ 

f - KTDLV  SEEELNNLRTTIRSINGLGKVLETQRSRVHLSNILDLHAYDILSGISLQKKLQ 

j - KYDTI EHSPEMVKQLKERVRE INS I APMFFTKY SDTPI QNLLDI HAYDSV 

★ * * 

AIAIIaa. SSSiSSISSSIssIAssiSIsS . Ss . SSSisSiiSSSiisSia . iaiSsiiSSSssiiSSSsasaa 


Figure  5-14.  (Continued) 


97 


300  310  320  330  340  350  360  370 


1 

_ 

HGEPGHV  HDEHCDH 

HHHHDHDHVHDEHCGHDHHHHHHDHKSDVHDDGVKGI 

m 

_ 

DHDHEHD 

DHVA 

SF 

n 

_ 

HDDPDHVCGPDCDHDHHH  HGHDHHHHDH 

HDHDHVCGPDCDHDH  DH  ASPIHDVTVKSV 

p 

- 

HGHEDHACGPDCDH  HHHDHGHDHGHDHHHHDHD 

HGHHDHAH 

DHGH  DHHHGAV S PIHDVTVQ  SV 

o 

_ 

HGHEDHACGPDCDH  HHHDHGHDHGHDHHHHDHD 

HGHHDHAH 

DHGH  DHHHGAVSPIHDVTVQSV 

q 

_ 

QEDHDHVCGPDCDHDHHH 

HDHHHHDHD 

H DHDHDH 

DHHH  HHHHDGPSPIHDVTVQSI 

r 

_ 

HDDHDH 

HDHDHHDHDGHHHHDHD 

GHDH  HHHDHPSDIHDVTVQSV 

k 

- 

HE 

HDDTVFSV 

c 

_ 

EGHDDHHH  GHDC  HDHHNEHEHE 

HF.HF.HH 

HSHDH  THDPGVGSV 

b 

_ 

HH  DHD  HDHHHDHNHD 

HDHHHHD 

GHDHHHHSHDH  THDPGVSSV 

a 

_ 

LLD 

LDHHHH 

HDHHHH  HHH  LHDATVSSE 

i 

_ 

HVPGT 

OPHLDO 

SIV 

h 

_ 

HVPGT 

OPHLDO 

SIV 

g 

- 

HVS  T 

APHLDO 

SIV 

f 

_ 

HVS  T 

APHLDO 

SIV 

j 

- 

a . SSSSsaiis . aaaaaSSSsas . sa . . saSaaaaa. SSSia . Sa . SSSsasaSaaSa . aSSSasiaa . sisii 
380  390  400  410  420  430  440 

I I I I I I I-... 

1 - SLTLDK PVDGQKITAWLNDLLARRGP DILRAKGIIDVKGEDKRLVFQAVHMILEGDFQRP_WTD 

m - SLVERR PVDPEKFFRWLQTTARAFGT DMLRMKGI  IAFAGDTDRYWQGVHMLVEGDHQRP_WKE 

n - SL RTGEIDPAKFFPWIQNITQTQGP NILRLKGI IAFKDDPDRYWQGVHMI IEGDHQRA_WKP 

p - SL RGGEMNPERFFPWIQKVTQTDGP NILRLKGI  I AFKGDAERYWQGVHMI I EGDHQRP_WKE 

0 - SL  RGGEMNPERFFPWIQKVTQTDGP NILRLKGI  I AFKGDAERYWQGVHMI  I EGDHQRP_WKE 

q - SL  RGGEMNPDRFFPWIOKITOTDGP NILRLKGI IAFAGDAERYWQGVHMI IEGDHQRP_WKD 

r - SL  RGGEMDPKKFFPWIEKITOMEGP NILRLKGI IALKGDDERYVI QGVHMI I EGDHQRA_WKD 

k - ALVQE GELDGEKLNAWISELLRTQGT DIFRMKGILNIAGEDNRFVFQGVHMIFDGRPDRL_WKP 

c - SIVCE GDLDLEKANMWLGALLYQRSE DIYRMKGILSVQDMDERFVFQGVHEIFEGSPDRL_WRK 

b - SIVCE GSLDLEKANMWLGTLLMERSE DIYRMKGLLSVHTMEERFVFQGVHDIFQGSPDRL_WGR 

a - SFVFD RPFDQGRLTDYLAGLLREEGD DIFRTKGIIAIAGDPRFFVLQAVHKLMDFRPDHV_WGK 

1 - TITFEVPGNAKEEHLNMFIQNLLWEKNVRNKDNHCMEVIRLKGLVSIKDKSQQVIVQGVHELYDLEETPVSWKD 
h - TITFEVPGNAKEEHLNMFIQNLLWEKNVRNKDNHCMEVIRLKGLVSIKDKSQQVIVQGVHELYDLEETPVSWKD 
g - TVTFDVPGSAEEESLNVFIQNLLWEKNVKNKDGRCMEVIRLKGLVSIKDKPQQMIVQGIHELYELEESRVNWKD 
f - TVTFEVPGSAKEECLNVFIQNLLWEKNVKNKDGHCMEVIRLKGLVSIKDKPQQMIVQGIHELYDLEESLVNWKD 


. i . . S. SSSiS . SSissiissiiSSSSSSaaasSaisiiaia . ii . iSSSSs . iiia. ia . iiSis . sSisisS 
450  460  470 


1 - KDKRYSRMVFIGRDLDEAELRAGFEA 
m - GEERVSRLVFIGRNLPKDVITDGFMACCA 

n - EEKHESRLVFIGRELD 

p - DEKRESRLVFIGRELDREKLENSFKACLATA 

0 - DEKRESRLVF IGRELDREKLENSFKACLATA 
q - GEKRESRLVFIGRDLDREKIERTFKAC 

r - GEKHESRLVFIGRELDAERLKKSFDAC 
k - NEKRKNELVFIGRNLDEAQLKQDFLACFA 
C - DETRTNKIVFIGKNLNREELEMGFRACL- 
b - EEERVNKIVFIGKNLNREELEKGFKACL- 
a - DMPYT_KLVFIGRNLDRAALERGLECCL- 

1 - DTERTNRLVLLGRNLDKDILKQ 

h - DTERTNRLVLLGRNLDKDILKQ 

g - DAERACQLVFIGKNLDKDILQQ 

f - DAERACQLVFIGRNLDKDVLQQ 


SSSSSSsiiii . SSiSSSsiSSSiSiaiiai 

(b) 


Figure  5-14.  (Continued) 


98 


Subfamily  3 


f 


10  20  30  40  50  60  70 


VPVTLLTGFLGAGKTTLLNELLAAPAMRDAAV 

VPVTLLTGFLGAGKTTLLNELLAAPAMRDAAV 

- MSLVASSHGRFLFVPCGSRSCLALCHAGGQPSPYRQGMSRDRIPVSIITGFLGAGKSTLLNRLLKDPEMSDAA1 

- MSLVASSHGRFLFVPCGSRSCLALCHAGGQPSPYRQGMSRDRIPVSIITGFLGAGKSTLLNRLLKDPEMSDAAI 

VPVSILTGFLGAGKTTLLNRVLKDPALADTAV 
DP I PVSVLTGFLGSGKTTLLNRLLKDP ALSDTAV 
PIPVSVLTGFLGAGKTTLLNRLLKDPVLADTAV 

IPITLITGFLGSGKTSFLSEYLNQIDHQGVAL 

IPITLITGFLGSGKTSFLSEYLNQIDHQGVAL 

IPITLI TGFLGSGKTSFLSEYLNQTDHQGVAL 

IPITLITGFLGSGKTSFLSEYLNQTDHQGVAL 

IPVTILTGFLGSGKTTLLKRILTEQHGMKIAV 

IPVTVLTGFLGAGKTTLLKYLLQAEHGMKIAV 

— P I AVTLLTGFLGAGKTTLLRH I LNEQHGFK I AV 


MNP I AVTLLTGFLGAGKTTLLRH I LNEQHGYK I AV 


— PIAVTLLTGFLGAGKTTLLRHILNEQHGFKIAV 


iaiiiaaa . aiiii . a . aaaaiiiaai . . a . a. iaa. iaaasi .iiiia.ii.i.a. iiissiiSSSsi . siii 


80  90  100  110  120  130  140 

I I I I I I I 

- IVNEFGSVPIDH GLVRTGSERYFHTTTGCICCTASSDIRTSL_YELHQVSLEGVMPEISRWIETTGLADPA 

- IVNEFGSVPIDH GLVRTGSERYFHTTTGCICCTASSDIRTSL_YELHQVSLEGVMPEISRWIETTGLADPA 

- IINEFGDVSIDH_MLVESAGEGIIELAEGCLCCTVRGELVDTL_AELIDGIQTGKLKPVKRWIETTGLADPA 

- IINEFGDVSIDH_MLVESAGEGIIELAEGCLCCTVRGELVDTL_AELIDGIQTGKLKPVKRWIETTGLADPA 

- IINEFGDVSIDH LLVEASSDGVIELSDGCLCCTVRGELVDTL_ADLMDRMQTGRIKPLKRWIETTGLADPA 

- I INEFGEVS IDH LLVEQASEGVI ELADGCLCCTVRGELVDTL_ADLIDRLQTGRIKALKRVI IETTGLADPA 

- IINEYGEVAIDH LLVEQASDGIIQLSDGCLCCTVRGELVDTL_ADLVDRLQTGRIARLARVIVETTGLADPA 

- IINEIGQAALDQRILSVQYCGEKMLYLNAGCVCCNKRLDLVESLKATLNNYEWRGEI LKRVI I ETTGLANPA 

- IINEIGQAALDQRILSVQYCGEKMLYLNAGCVCCNKRLDLVESLKATLNNYEWRGEI LKRVI I ETTGLANPA 

- IINEIGQAALDQRILSVQYCGEKMLYLNAGCVCCNKRLDLVESLKATLNNYEWHGEI LRRI I I ETTGLANPA 

- IINEIGQAALDQRILSVQYCGEKMLYLNAGCVCCNKRLDLVESLKATLNNYEWHGEI LRRI I I ETTGLANPA 

- IENEFGEENIDNDIL_VQDGNEQIVQMSNGCICCTIRGDLVAAL_SDLLTRRDKGEL_QFDRWIETTGVANPG 

- IENEYSETPID_GQL_LGVEPVQVMTLSNGCVCCSINTDLEKAL_FLLLERRDNGEI_DFDRLVIECTGLADPA 

- IENEFGEVSVD_DQL_IGDRATQIKTLTNGCICCTRSNELEDAL_LDLLDSRDRGDI_AFDRLVIECTGMADPG 

- IENEFGEVSVD_DQL_IGDRATQIKTLTNGCICCSRSNELEDAL_LDLLDNLDKGNI_QFDRLVIECTGMADPG 

- IENEFGEVSVD_DQL_IGDRATQIKTLTNGCICCSRSNELEDAL_LDLLDNLDKGNI_QFDRLVIECTGMADPG 

- IENEFGEVSVD_DQL_IGDRATQIKTLTNGCICCSRSNELEDAL_LDLLDNLDKGNI_QFDRLVIECTGMADPG 

- IENEFGEVSVD_DQL_IGDRATQIKTLTNGCICCSRSNELEDAL_LDLLDNLDKGNI_QFDRLVIECTGMADPG 

- IENEFGEVSVD_DQL_IGDRATQIKTLTNGCICCTRSNELEDAL_LDLLDSRDRGDI_AFDRLVIECTGMADPG 

* . * * * * * . * * . *...*.**.*  * 

IiAAI . si . IAisiiilss . isilisl . s . AIAAS . s . sIssilaisliSSSiS . SIsSISAIIIAIA. IIs . i 


Figure  5-14.  (Continued) 


3 iflN  ii  «i  o g p (i)  c n d p-m  # h,  3 aid 


99 


150  160  170  180  190  200  210  220 

•I I I I I I I |.. 

f - PIINSLIAGGTPALGLRDHIVARHFHLSGWTVFDVISGRAMLDSFIEGWKQLAFADHIVLTKTDLADA A 

e - PIINSLIAGGTPALGLRDHIVARHFHLSGWTVFDVISGRAMLDSFIEGWKQLAFADHIVLTKTDLADA A 

h - PVMQSVMGNP ,VIAQNFVLNGMITWDAVNGLSTLDNHEEAVKQAAVADRLVMTKRTLADASA_IA 

g - PVMQSVMGNP VIAQNFVLNGMITWDAVNGLSTLDNHEEAVKQAAVADRLVMTKRTLADASA_IA 

i - PVLQSVLGNP VIAQNFRLDGWTWDAVNGEQTIANHVEAMKQVAVADRLVISKSGLAKDDG_VG 

j - PVLHS IMGHP LLTQVFRLDGVLATVDAVNGMATLDNHEEAVKQAAMADRIILTKTDLPEAQAGLP 

k - PVLQSIMAHP ALVQAFRLDGVIALVDAVNGNATLDAHVEAVKQVAVADRIVLSKVDLVTDTGDLD 

d - PILWTNLSDV FLGAHFEIQSWACVDVLNAKTHLTNN_EAKEQ IVFADSVLLTKTDLQNDSTALM 

c - PILWTNLSDV FLGAHFEIQSWACVDVLNAKTHLTNN_EAKEQ IVFADSVLLTKTDLQNDSTALM 

b - PILWTILSDT FLGVHFEIQSWACVDALNAREHLTNN_EAKEQIVFADSVLLTKTDLQNDSAALT 

a - PILWTILSDT FLGVHFEIQSWACVDALNAREHLTNN_EAKEQ IVFADSVLLTKTDLQNDSAALT 

1 - PVAQTF FMDD EIAARYLLDAVITLVDAKHGNQQLDRQEEAQRQVGFADAIFITKGDLVSD_DDVE 

m - PVAQTF  FADE ELCQRYVLDGIITLVDAANAERHLQE_TIAQAQVGFADRILVSKTDLVDA_ATFE 

O - PIIQTFFYHD VLCERYLLDGVIALVDAVHANEOMNOFTIAOSOIGYADRILLTKTDV  A GDSE 

s - PIIQTFFSHE ILCQRYLLDGVIALVDAVHADEQMNOFTIAOSOVGYADRILLTKTDV  A GEAE 

r - PIIQTFFSHE ILCQRYLLDGVIALVDAVHADEOMNOFTIAOSOVGYADRILLTKTDV  A GEAE 

- PIIQTFFSHE VLCQRYLLDGVIALVDAVHADEOMNOFTIAOSOVGYADRILLTKTDV  A GEAE 

- PIIQTFFSHE VLCQRYLLDGVIALVDAVHADEOMNOFTIAOSOVGYADRILLTKTDV  A GEAE 

- PIIQTFFSHD VLCERYLLDGVIALVDAVHANEOMNOFTIAOSOIGYADRILLTKTDV  A GDSE 

* * k ^ k k * 

. II . iSii  . .s.ii.  is  . S.iiSslils  . IIIIIAIisiSSIIss  . siiisAIillAsIIIsAssI . SSSSSiS 


230  240  250  260  270  280  290 

I I I I I I I 

- AIDQKSLTDLNPAAHLHDRNAADFDLMSLFSTKDYSLHGKTEDVPGWLAADRLSL HA GHD 

- AIDOKSLTDLNPAAHLHDRNAADFDLMSLFSTKDYSLHGKTEDVPGWLAADRLSL  HA GHD 

- ALTAR  LEALNPRATIEDGDKADWSAAGLLDNGLYDVSSKNADVGRWLGEES  GHGH 

- ALTAR  LEALNPRATIEDGDKADWSAAGLLDNGLYDVSSKNADVGRWLGBES  GHGH 

- RLEAR  LRDLNPRAPIIDGDTEEAGRADLFACGLYDATTKVADVGRWLODEAHAD  GHGHV 

- ALKAR_LRTLNPGADILEAGEERTGYAALFECGLYNPQTKSADVRRWLKAEAYED EHHGHVCGPDCGHDHH 

- TLRAR_LRQINPGAELLDAGHLTTGVAALFDCGLYNPATKSADVRRWLGEEAAHD HAHGH HHDGH 

- KLKER_IQSLNPSAEIFDKKNIDYE SFFSRKNGARNFMLRMPKDSHSQ 

- KLKER_IQSLNPSAEIFDKKNIDYE SFFSRKNGARNFMLRMPKDSHSQ 

- KLKER_IQALNPSAEIFDKRAIDYE SLFSRKNRARNFMPRMPKDSHSQ 

- KLKER_IQALNPSAE I FDKRAI DYE SLFSRKNRARNFMPRMPKDSHSQ 

- ALRHRLLH_MNPRAPIRTANFGEAPIDTIFDLRGFNLNAKLEIDPDFLREDDHDHDHGHEHNHACSPDCDHD 

- ALGQRLQR_INRRALVHWEHGRIDLAHLLDVRGFNLNADL 

- KLRERLAR_INARAPVYTWHGDIDLSQLFNTSGFMLEENV 

- KLRERLAR_INARAPVYTVTHGDIDLGLLFNTNGFMLEENV 

- KLRERLAR_INARAPVYTVTHGDIDLGLLFNTNGFMLEENV 

- KLHERLAR_INARAPVYTVTHGDIDLGLLFNTNGFMLEENV 

- KLHERLAR_INARAPVYTVTHGDIDLGLLFNTNGFMLEENV 

- KLRERLAR_INARAPVYTWHGDIDLSQLFNTSGFMLEENV 

•k  k 

sIS. silssIA. slsIssiSSSsIsiiSiisiS . IsiSSSi . si . . iiSSS . . s . aa . ssasa . a . . aasaasa 


Figure  5-14.  (Continued) 


PifllDiitoogt-'Pi&'ntt^u.  p-iajj-jiHi  uflimh  n o g hi»  tfo  axu.  h-c 


100 


300  310  320  330  340  350  360  370 


f - HDINRHGAGVEAFDLTFDGALDPQAIVTFLELVTSNVHSGLLRLKG 

- HDINRHGAGVEAFDLTFDGALDPQAIVTFLELVTSNVHSGLLRLKG 

- _HDHDHDHDHAHG HHHHHDVNRHDAS I RS F S IIHDQ P IDPMAIDMFVDLLRSAHGEKLLRMKA 

- _HDHDHDHDHAHG HHHHHDVNRHDASIRSFSIIHDQPIDPMAIDMFVDLLRSAHGEKLLRMKA 

- _HDHHHEDGHGHGH HHHHHDVNRHGSD IRSFS IVHDRP I EPMALEMF IDLLRSAHGEKLLRMKA 

- HHDHHHDHHNDQGY AHHHH HDDAIRSFSLRHDAPIPVSTFEMFLDLLRSTHGEKLLRMKG 

- DHGHDHDHSHDHG HR HDSRVRSY SLVHDGPVPF SAX  EMFLDLLRSTHGEKLLRMKG 

- GFETLSVSFEGTMEWSAFGIWLSLLLHQYGTQILRIKG 

- GFETLSVSFEGTMEWSAFGIWLSLLLHQYGTQILRIKG 

- GFETLSINFEGTMEWSAFGIWLSLLLHQYGTQILRIKG 

- GFETLSINFEGTMEWSAFGIWLSLLLHQYGTQILRIKG 

- _HEHGHDHHHGHG HHHHHGHAHHTDRIASFVFRSERPFNYTKLEEFLSGVLNVYGEKLLRYKG 

- GPGIGLRPLRAVAAKDSRDRIGTLVLRSDTPLDLERLSEFMDDLLQWHGNSLLRYKG 

- LASQ  PRFHF I ADKQNDVS  S I WELDYPVDI S EVSRVMENLLLESADKLLRYKG 

- VSTKPRFHF IADKQND I S S I WELDYPVDI SEVSRVMENLLLESADKLLRYKG 

- .VSTKPRFHFI ADKQNDI SS I WELDYPVDI  SEVSRVMENLLLESADKLLRYKG 

- .VSTKPRFHF IADKQNDISSIWELDYPVDISEVSRVMENLLLESADKLLRYKG 

- VSTKPRFHF  IADKQND  I S S I WELD  YPVD I S EV  S RVMENLLL  E S ADKLLRYKG 

- LASQ  PRFHF  I ADKQNDVS  S I WELDYPVDI  SEVSRVMENLLLESADKLLRYKG 

sasasaSSSSSS . s . .a. .a.i.SS. . isiiisiSSSIs . IilSissi Isisilsillsili . siissIIAIAi 


380  390  400  410  420  430  440 


- I FGATDDPERPWAHAVQHRLYPL I RL ESWPD_GDRSTRIVMIGMDVPQQPIRDLFNALAAQAG 

- I FGATDDPERPWAHAVQHRLYPL I RL ESWPD_GDRSTRIVMIGMDVPQQPIRDLFNALAAQAG 

- IVKLSDNPGRPLVLHGVQNIFHTPERL AAWPDPTDQRTRMVLITKDLPEAFVKDLFAAFTGTPGIDRP 

- IVKLSDNPGRPLVLHGVQNIFHTPERL AAWPDPTDQRTRMVLITKDLPEAFVKDLFAAFTGTPGIDRP 

- IVSVADNPERPWLHGVQTVFHAPERL AAWPDPADRRTRMVLITKGLDEAFVRDLFDAFTGKPRVDRP 

- IVQIAEDPDRPWIHGVQKIFHPPARL PQWPQ_GKRETLLVLIVKDLPEAYVRELFDAFLGRPGLDRP 

- VIELSEDPSRPLVIHGVQKILHPPARL PAWPD_GQRGTRLVL I TLDMPEDYVRRL FAAFTNRP S I DT P 

- IIDIGSD LLVSINGVMHVIYPP 

- IIDIGSD LLVSINGVMHVIYPP 

- IIDIGSG FLVSINGVMHVIYPP 

- IIDIGSG FLVSINGVMHVIYPP 

- VLYM_EGVDRKWFQGVH QLMGSDVGGKWDG_ETPGNRMVFIGVDLPRDTI 

- VLNI_ADEPRRLVFQGVL RLYGFDWDSEWRDDEARESVIVFIGDNLPEDSIREGF 

- MLWI_DGEPNRLLFQGVQ RL Y SADWDRPW_GDET PHS TLVF IG IQLPEDE I RAAF 

- MLWI_DGEPNRLLFQGVQ RLYSADWDRPW_GDEKPHSTMVFIGIQLPEEEIRAAF 

- MLWI_DGEPNRLLFQGVQ RLYSADWDRPW_GDEKPHSTMVFIGIQLPEEEIRAAF 

- MLWI_DGEPNRLLFQGVQ RLYSADWDRPW_GDEKPHSTMVFIGIQLPEEEIRAAF 

- MLWI_DGEPNRLLFQGVQ RLYSADWDRPW_GDEKPHSTMVFIGIQLPEEEIRAAF 

- MLWI_DGEPNRLLFQGVQ RLYSADWDRPW_GDETPHSTLVFIGIQLPEDEIRAAF 

★ ..  •••  ^ ...... 

IIsI . SSSss . Iiliills . iisissii . iaiSSSiSSSSSSsisiiiiSSSiSSS . issiisii . ssisias . 


(c) 


Figure  5-14.  (Continued) 


OflrtHaHO'Ofl.  (DHiIQCj'  H-lD  l-‘gpolQrrii0i(uD'O 


101 


Subfamily  4 

10  20  30  40  50  60  70 

I I I I I I I 

d - MSTVDHAPQTSEAMAIPKQGLPVTIITGFLGSGKTTLLNYILSNQQGLKTAVLVNEFGEIGIDNELVI 

- MSTVDHAPQTSEAMAIPKQGLPVTIITGFLGSGKTTLLNYILSNQQGLKTAVLVNEFGEIGIDNELVI 

SQPMDAPKQGMPVTIITGFLGSGKTTLLNHILSNQQGLKTAVLVNEFGEIGIDNELIV 

IPKRGMPVTIITGFLGSGKTTLLNQILKNKHDLKVAVLVNEFGDINIDSQLLV 

PVTVL  SGYLG  SGKTTLLNH I LQNREGRRI AVI VNDMS EVN I DKDLVADGG 

PVTVL  SGYLG SGKTTLLNH I LQNREGRRI AVI VNDMS EVN I DKDLVADGG 

PVTVLSGYLGAGKTTLLNS ILQNREGLKIAVIVNDMSEVNIDAGLVKQEG 

PVTVLSGYLGAGKTTVLNHVLANKEGMKVAVIVNDMSEVNIDATLVKQ_G__ 

LPVTVLSGFLGAGKTTLLNEILRNREGRRVAVIVNDMSEINIDSAEVERE 

L PVTVL SGFLGAGKTTLLNEILRNREGRRVAVIVNDMSEINIDSAEVERE 

LPVTVLSGFLGAGKTTLLNEILRNREGRRVAVIVNDMSEINIDSAEVERE 

LPVTVLSGFLGAGKSTLLNHVLKNRENRRVAVIVNDMSEINIDGSEVQRD 

lpvtvlsgflgagkttllnailrnrqglrvavivndmsevnldaesvqrd 

lpvtvlsgflgagkttllnhilanrdglrvalivndmseinidaalv_rdg__ 

lpvtvlsgflgagkttllnhilanrdglrvalivndmseinidaalv_rdg 

L PVTVL SGFLGAGKTTLLNHVLNNRENRRVAVIVNDMSEINVDAALI_REG 

lpvtvfsgflgagkttllnrllnnrdglrvavivndmsevnidaqlv_rdg 

LPVTVFSGFLGAGKTTLLNRLLNNRDGLRVAVIVNDMSEVNIDAQLV_RDG__ 

lsgflgagkttllnhvlanregkrvavivndlsdvnidaqlv_gdgtat 

LSGFLGAGKTTLLNHILANRAGMKVAVIVNDLAAANVDATFV_RGAT_ 

* * * **.*.**..*.* * ' . ** * ' ' 

iaaiaai  . aaaSiiSi  .as.i.iaili.II.i  . AiAIIASIIsASSSssIIIIIAsI . sIsIAssilSSS . a . a 


n 

m 

1 

P 

i 

h 

g 

f 

e 

j 

k 


80  90  100  110  120  130  140 


ardrnmvet.sngcvcctinedlveavykvlereokid 

YLWETTGLAD  PL  PVA 

ASDKNMVET  iSNGCVCCTTNEDLVEAVYKVLEREOKID 

YLWETTGLADPLPVA 

STDENMVELSNGCICCTINNDLVDAVYKVLEREEKLD 

YLWETTGLADPLPVA 

SVDQDMLELSNGC ICCT INDGLVDAVYRVLEREERID 

YLVIETTGVADPLPI I 

GT.qRT  DF.KT ,VF.T .SNGCTCCTI.RDDLLKEVER  LVKKGGID 

QIVIESTGISEPVPVA 

rtT.SRT  F1F.KT ,VRT ,SNGC T GGTT.RDDLLKEVER  LVKKGGID 

QIVIESTGISEPVPVA 

GLSRT  DEKLVEMSNGC ICCTLREDLL I EVER  LAKDGRFD 

YIVIESTGISEPIPVA 

GFTRT  DEKLVELQNGC ICCTLREDLI I EVEK  LARLGNID 

YILIESTGISEPIPVA 

TST.SRRFFEKT.VRMTNGGTGSTLR  dllseisa  larsgrfd 

YLLIESSGISEPLPVG 

ISLSRSEEEKLVEMTNGCICSTLR  DLLSEISA  LARSGRFD 

YLLIESSGISEPLPVG 

TST.SR.qRK  KT .VF.MTNGC T CCTLREDLLSE I SA  LAAAGRFD 

YLL I ES  SGI S EPLPVA 

vqT.WRAF.F.  RT.VF.MSMGCICCTLREDLLEEVGR  LAREGRFD 

YLLIESTGISEPMPVA 

VSLHRGRDE  LIEMSNGCICCTLRADLLEQISD  LARQQRFD 

YLLIESTGISEPMPVA 

GAMT,.qRT  F.FQT .VFMTNGCICCTLRDDLLTEVRA  LAEOGRFD 

YLLIESTGIAEPLPVA 

GANLSRT  EEQLVEMTNGC ICCTLRDDLLTEVRA  LAEQGRFD 

YLLIESTGIAEPLPVA 

GADLSRT  EEQLVEMTNGC ICCTLRDDLLKEVSQ  LAAQGRFD 

YLLIESTGISEPLPVA 

GAAT..qRT  DKTT .VF.FSNGCTCCTLRDDLSKEVKR  LALEORYD 

YLLIESTGIGEPMPVA 

GAALSRT  DETLVEFSNGCICCTLRDDLSKEVKR  LALEQRYD 

YLLIESTGIGEPMPVA 

qADT.qRT  nF.RMVF.T .SNGCTCCTLREDLLDEIAR  LAREDRFD 

YLLIESTGVAEPLPIA 

elshv_eahlvemsngcicctlrddllveirr_laaenrfdairgridghspsrcry A 

..*..***.*.*....* * * 

. isissiSSSsIIAIsA. AIAiAISSSIiSilsSiliSSSsIA. . a . a . a . aa . aaaaliiiaii . iis . i . ii 


Figure  5-14.  (Continued) 


102 


150  160  170  180  190  200  210  220 


d - LTF LGTELRDMTRLDS I ITM VDCENFSLDLFNSQAAQS 

C - LTF LGTELRDMTRLDS  I ITM VDCENFSLDLFNSQAAQS 

b - LTF LGTELRDLTRLDS I ITV VDAANYSLDLFNSQAAYS 

a - LTF LGTELRDLTNLDS ILTV VDAEAFEPNHFESEAALK 

s - QTFSYIDDELGIDLTAICRLDTMVTV ,VDANRF___VHD INS EDLLMDRDQ SVDETDERS IADL 

r - QTFSYIDDELGIDLTAICRLDTMVTV VDANRF  VHDINSEDLLMDRDO SVDETDERS T APT . 

t - QTFSYIDEEMGIDLTKFCQLDTMVTV VDANRF  WHDYOSGESLLDRKEALGEKDEREIADT , 

q - QTFTYLDEELGIHLGDRCTLDTMVTV VDANRF  WEDYESGESLLDRKQTDDEADTREWDL 

0 - ETFTFIDTD_GHALSSPDSTPWSPLV DANSF LRDY TAGGRVEADAPEDERDIADL 

n - ETFTFIDTD_GHALSSPDSTPWSPLV DANSF LRDY TAGGRVEADAPEDERDIADL 

m - ETFTFIDTD_GHALADVARLDTMVTV VDGNSF  LRDY  TAGGRVEADAPEDERDIADL 

1 - ETFTFRDEL_DRSLSDVARLDTMVTV VDGMNF  LRDYOEAASLASRGETLGEEDORSTTDT, 

P - ETFAFLDTE_GFSLSELARLDTLVTV VDGSQF QALLESTDTVARADTEAHTSTRHLADL 

i - ATFDFRDEN_GRSLSDVAQLDTMVTV VDAANL LNDYGSSDFLADRGETAGDGDNRTIVDL 

h - ATFDFRDEN_GRSLSDVAQLDTMVTV VDAANL  LNDYGS SDFLADRGETAGDGDNRTIVDT . 

g - ATFEFRDER_GKSLSDGARLDTMVTV VDAANF LKDYASGDFLRDRGESLGEEDERTLVDL 

f - ATFAVRDAD_GFCLSDVARLDTMVTV VDGS Q F___TALF S SMNTLADLGQQ AD S DDTRSLADL 

e - ATFAVRDAD_GFCLSDVARLDTMVTV VDGS  Q F___TALF  S SMNTLADLGQQ  AD  S DDTRSLADL 

j - ETFTFEDAE_GRLLSDLARLDTMVTV VDAOHF  LADYDAADFLADRGOARDEDDDRTWDT, 

k - ETFTFVDDD_GSTIADVARLDTMVTV IDAFNF LHDYARDDALAEHGLAATDEDDRTLVEL 

★ ★ ★ 

iAIsisaSsiSSSI . siiSiaiiiaiaa . aaa . ia . ilAiSSIsis . ssisiSSSiSSSSSSSSSSSSasiisi 


230  240  250  260  270  280  290 


d - QITYGDIIVLNKTDLVEEADVDSLEVRLRDLKTGARILRTQK 

C - QITYGDIIVLNKTDLVEEADVDSLEVRLRDLKTGARILRTQK 

b - QIAYGDVILLNKTDLVDEASLNDLERKINEVKEGARILRTKR. 

a - QLTYADIILLNKTDLATSEKIQALEDYIQTVKDSARILHTKY. 

s - LIDQVEFCDVLIINKIDLISEEELAKLEKVLSALQPIAKIIKTTN. 
r - LIDQVEFCDVLIINKIDLISEEELAKLEKVLSALQPIAKIIKTTN 
t - LIDQIEFCDVLILNKCDLVSEQELEQLENVLRKLQPRARFIRSVK 
q - L I DQ I E FANV I LLNKVDLVEEDDVRELTAVLHKLNP EAE I L PV S H. 

0 - LVDQIEFADVILVSKADLISAPAPGRIDGVLRSLNEPLRLFDDGL, 
n - LVDQIEFADVILVSKADLISAPAPGRIDGVLRSLNEPLRLFDDGL. 
m - LVDQIEFADVILVSKADLISHQHLVELTAVLRSLNATAAIVPMTL. 

1 - LIDQVEFADVILVSKIDLISSEQREELLAILRQLNAEAEILPMVM. 
P - LIEQVEYANVILVNKRDLIDEPGYQAVHAILAGLNPSARIMPMAH. 
i - LVEQIEFADVWLNKIGTATPEERDAARKIIVGLNPDAKLIEADF, 
h - LVEQIEFADVWLNKIGTATPEERDAARKIIVGLNPDAKLIEADF. 
g - LVEQIEFANVWLNKISMVSKEECALARKVIRSLNPDARIEETDF. 
f - LGQQ I EFADTI VI SKCDLIDTTQLARVRAWRGLNRDAQ 1 1 EAHL. 
e - LGQQ  I EFADTIVI  SKCDLIDTTQLARVRAWRGLNRDAQ  1 1 EAHL. 
j - LIEQVEFCDVIVLNKTDLVDEAERERLAGILHSLNPRARILPASF. 
k - LIEQIEFCDVLVINKADLVDADALARLQRILANLNPRAQQIVSRF. 


AEVPLPLILSVG 
.AEVPLPLILSVG 
.SQVPLPLILSVG 
GEVALPLILGVG 
SEVDLKEVLNTQ 
.SEVDLKEVLNTQ 
GNVKPQEILHTG 
GRVDLNRILHTG 
.GRIPLDTILDTG 
.GRIPLDTILDTG 
GRIPLDTILDTG 
GQVPLAKILDTG 
GNVALSSLLDTH 
.GKVELK  EVLGTG 
GKVELKEVLGTG 
GQVPLNTILDTR 
.GEVPLPQLVGTG 
.GEVPLPQLVGTG 
GHVPLDAILNTH 
GDVPLAEVINTG 


iisAISIisIIIIsA. sllSSSSSSSiss . i . ssissiiSsiSSSSSiSS . sS . iaiiSSSSiSISiSsIISIs 


Figure  5-14.  (Continued) 


p-  p.|^,  ngsoontu  )fu.io  S'  H-|nj  H3iloflnti»  (c  u.  ® n o s’  n-lt)  (-■  g 3 o 


103 


300  310  320  330  340  350  360  370 


d - LFESDK 

C - LFESDK 

b - LFESDK 

a - LTPKDD 

S - RFDFEKASESAGWIKELESGGHASHTP 

r - RFDFEKASESAGWIKELESGGHASHTP 

t - LFNFEEASGSAGWIQEL_TAGHAEHTP 

- LFDFDKASQGAGWIKEL NEEHTP 

- T.FST.F.KAAOAPGWLOELO  G EHTPETEEVRWCTASTSHPSTPNVCMTSSAASGPTESYSGPRRSTGMTRAG 

- LFSLEKAAQAPGWLQELQ_G_EHTPETEEVRWCTASTSHPSTPNVCMTSSAASGPTESYSGPRRSTGMTRAG 

- T.FRT.F.KAAOAPGWLOELO  G EHTPETEE 

- RFDFDRAASAPGWLKELR G EHLPETEE 

- T.FPT.PET.AASPGWMRKME  A TDTPASE  

- RFDFARAEEHPLWFKELH GFKDHIP_ 

- RFDFARAEEHPLWFKELH GFKDHIP 

- LFDFNKAHEHPLWYKELY GFNQHVP 

- RFDFTRAQLAPGWMRELR_G_EHTP 

- RFPFTR  APT APGWMRELR  G EHTP 

- RFPFPAAAAAPAWLAELR  G EHV . 

- RFDFDAAANAPGWLASL EHRRDADEAE_ CGQGQSQGDGRVH 

slsisSissiiiiississ . . . sSiSSS . saisiaaiaaaa . aa. aiaiaaaila. .aaai. .SSSa. . . .ais 

380  390  400  410  420  430  440 

I I I I I I I 

- _ETEEYGISSFVYKRRLPFHAKRFNDWLES MSNN  WRSKGIVWLAOYNHVACLLSQAGSSCNIHPVTYW_ 

- _ETEEYGISSFVYKRRLPFHAKRFNDWLES MSNN_WRSKGIVWLAQYNHVACLLSQAGSSCNIHPVTYW_ 

- _ETEEYGI S SFVYKRRLPFHSTRF YRWLDQ MPKN  WRAKGIVWCASHNNLALLMSQAGPSVTIEPVSYW_ 

- _ETEEYGISSFVYKRQRPFHPERWMEWLEN WPOE  WRAKGFFWLATRUEMAGLISQAGTSIVIQGAGLW_ 

- SPRSDY F QAGHLI RHGYV 

- SPRSDY F QAGHLIRHGYV 

_ YGISSWYRERAPFHPQRLHDFLSSEW TNGKLLRAKGYYWNAGRFTEIGSISQAGHLIRHGYVGRWW 

- YGIASMAYRARRPFHPARFHAFLHNPL KQGRLLRSKGFFWLASRPREAGSWSQAGGLMRYGLAGRWW 

- __SDTYGVTSWVYRERAPFHPQRLLEFLQKPW__HNGRLLRSKGYFWLASRHLEIGLLAQSGKQFQWDYVGRWW 

- _ETEEYGIRSFVYRAKKPFDPVLFQRFIDRAW^_PGWRAKGFFWLATRPRYVGEI  SQAGALVRTGKMGLWW 

- _ETEEYGIRSFVYRAKKPFDPVLFQRFIDRAW PGWRAKGFFWLATRPRYVGEISQAGALVRTGKMGLWW 

- _ETEEYGARSFVYRARKPFDPARFQAFIDQNW PGWRSKGFFWLATRPDFVGEISQAGALVRTSKRGRWW 

- _ETEEYGITSFVYRARRPFHPLRFNHLLAHGL SGVIRSKGFFWLASRMDWVGELSSVGSATRTQAAGFWY 

- _ETEEYGITSFVYRARRPFHPLRFNHLLAHGL SGVIRSKGFFWLASRMDWVGELSSVGSATRTQAAGFWY 

- AETEALGIRSFVYRRRRPFHPQRLWDLMHTEWLREHGRVLRSKGYFWLAPRMDVAGSWGQAGGVMRHGGAGAWW 

- SEADEYGIGHFVYRARRPFHPQRLWALLHEEW KGVLRSKGFFWLATRNDIAGSLSQAGGVCRHGPAGHWW 


. s . ssi . issiiiSSSs . i . isslsSiiSSSiiSSSssiiaia. iiliiSSSsiii . i . . i . siisis . SSSii 


450  460  470  480  490  500  510 


3 

f 

e 

j 

k 


VASMSEAOOT 

Q I LAERQDVAAE 

VASMSF.AOOT 

QILAERQDVAAE 

VAALPKLEOE 

QVKQQEPEILEE 

VAAYDKEEOO 

QVLAEEPELAAR 

EPLVEV 

LPRDEWP 

ADDYRRDGI LDK 

EPLVEV 

LPRDEWP 

ADDYRRDGILDK 

KF 

LPRDEWP 

ADDYRRDGI LDK 

RE 

VPTGOWP 

QDEEGLRAIMRH 

NFTF.P 

SOWP 

RDEYRLQGIMAK 

SA 

VPREOWP 

REKOFLD 

MME 

SA 

VPREOWP 

REKOFLD 

MME 

SA 

VPKHYWP 

AEPEWOR 

AMQ 

AARHRVDVHEENMPPQIGIAPLPYTELGWQRQQNDCWT APLPAPHEVSDPGEYVAMQ 

AARHRVDVHEENMPPQIGIAPLPYTELGWQRQQNDCWT APLPAPHEVSDPGEYVAMQ 

AAVDTAEWPDE DEARTD_IADKMLDNGAPAPW 

AAQDRTEWP EAGDE  LYDEIV  ADWHGELADT 

s. . a ■ iiiSSS . siSSSsi . . ai . ii ■ iSSSSSSissa ■ ss . SSS ■ iissiiaa. i . isis . .SSSSSSiis.s 


Figure  5-14.  (Continued) 


104 


520  530  540  550  560  570  580  590 

•I I I I I I I I • • 

S - WDPF.YGDRHTOFVIIGTE  LDEEKLTKELDACLVNAOEID  ADWOOFEDPY 

r - WDPEYGDRHTOFVIIGTE  LDEEKLTKELDACLVNAOEID ADWQQFEDPY 

t - WDPEFGDRLTOLVFIGTD  LDEETITKELDOCLLTEYEFD SDWSLFEDPF 

q - WHRKYGDRMNELVFIGLD  MDRKEIEKSLDTCLLTDEELG QDWSALPDP 

0 - WEEPVGDCRQELVFIG_QAIDPSLLHRELDACLLTTAEIELGP 

n - __WEEPVGDCRQELVFIG_QAIDPSLLHRELDACLLTTAEIELGP 

m - ,WEEPVGDCRQELVFIG_QAIDPSLLHRELDACLLTTAEIELGPDVWT 

1 - WEEGVGDCRQELVFIG_QNLDFAALRAALDACLLDDAEMNLG 

P - _WDSWGDCRQELVFIG_QGLDTRVLQRELDHCLLSAQEIAAGPLAW 

i - PYLDPTFGDRRQEIVFIGSDPMSEARIRAQLDACLIDTEAF TPDAWRHLPDPFANWD 

h - PYLDPTFGDRRQEIVFIGSDPMSEARIRAQLDACLIDTEAF TPDAWRHLPDPFANWD 

g - PYFHEVWGDRRQEIVFIGIDPMRQEAIIAELDKCLVQEECF APECWSGLSDPFPNW 

f - RLWHPQWGDRRQELAVIGLD_MDKDVARNELDACLLSNEELRMGPMYWSALPDPFPQWR 

e - RLWHPQWGDRRQELAVIGLD_MDKDVARNELDACLLSNEELRMGPMYWSALPDPFPQWR 

j - GDRRQELVFIGID_LDEAALAASLDACLLTDAEFAAGPAAWAELPDPFPDWGFDD_NHGEAED DH 

k - SIGDRRQELVLIGIG_LDAAAWRAKFDACLLTGAEYAQGKQAWAGYADPFPAWDVDDHDHDHAHDHHDH 

siiSSSi . aisisiiii . . s . iSSSsiSSSiasaiiSSS . isiSSSiissisa. iisisiaaasassisaaaaa 

(d) 

Subfamily  5 

10  20  30  40  50  60  70 


g _ KLPVTIVTGFLGAGKTTLLRHMLDNAEGRRIAVIVNEFGELGIDGEILKQCSI_GCSEEEAQG 

i - RVPCTIVTGFLGAGKTTLLRGLLEKLDGKRLAIIVNEFGDIGIDGEILKGCGIESCPEEN 

h - RVPCTIVTGFLGAGKTTLLRGLLEKLDGKRLAIIVNEFGDIGIDGEILKGCGIESCPEEN 

j _ RVPCTWTGFLGAGKTTLVRHLLENAGGKRIAIIVNEFGDIGIDGEILKGCGIDTCPEEN 

k - KIPVTVITGFLGAGKTTLIRHLMANPEGRKLAVLVNEFGTVGVDGEILRQCADENCPDEN 

m - MTLARAPQRKI PATVITGFLGAGKTTMIRNLLQNADGKRIALI INEFGDLGVDGDVLKGCGAEACSEED 

0 - MTTARANQGKIPATVITGFLGAGKTTMIRNLLQNADGKRIGLIINEFGDLGVDGDVLKGCGAEACTEDD 

n - MTTARANQGKI PATVITGFLGAGKTTMIRNLLQNADGKRIGLI INEFGDLGVDGDVLKGCGAEACTEDD 

1 - KIPATVITGFLGSGKTTMIRNLLENANGKRIALIINEFGDLGVDGGILKRCGIETCREED 

f - --KIPVTWTGFLGSGKTTLVRNLLQNNQGRRIAVLVNEFGEVGIDGDILRSCGV CDED GNQ 

e - --KIPVTWTGFLGSGKTTLVRNLLQNNQGRRIAVLVNEFGEVGIDGDILRSCGV CDED  GNQ 

d - — KIPVTVITGFLGSGKTSLIRHLLQNNAGRRIAVLVNEFGELGIDGDLLKSCQV CPEDEDGGSN 

b - RSTPOOIPWVLAGFLGSGKTTLLNHLLHRSGGSRIGAWNDFGSIEIDA  MAVAG ALGDST 

a - RSTPQQIPVWLAGFLGSGKTTLLNHLLHRSGGSRIGAWNDFGSIEIDA_MAVAG ALGDST 

c _ QRIPVTVFTGFLGAGKTTLLSGLIRDNKQRRLAILVNEFGEVSIDGALLRSDG ERGGAE 

★ *******  * * * * 

iaSiaiS . Ssl . IIIII . II . i . AA. IIssIISSSSSSslilllAsI . sIsIAiSilsSiSSSsaSSSSSSSSS 
80  90  100  110  120  130  140 


g 

i 

h 

j 

k 

m 

0 
n 

1 
f 
e 


_RVFELANGCLCCTVQEEFFPVMRELVAR_RGDLDQILIETSGLALPKPLVQAFQWPEIRNACTVD 

IVELANGCICCTVADDFQPAIEQILSR_QPKVEHILIETSGLALPKPLVQAFQWPAIKSRVTVD 

_IVELANGCICCTVADDFQPAIEQILSR_QPKVEHILIETSGLALPKPLVQAFQWPAIKSRVTVD 

_IVELANGCICCTVADDFVPALDQILSL_NPKVDHILIETSGLALPKPLVQAFQWPTVKSRVTVD 

IVELANGCICCTVADEFIPTIEALMAR_PVRPDHILIETSGLALPKPLLKAFDWPAIRSKITVD 

IIELTNGCICCTVADDFIPTMTKLLER_ENRPDHIVIETSGLALPQPLVAAFNWPDIRSEVTVD 

_IIELTNGCICCTVADDFIPTMTKLLER_ENRPDHIIIETSGLALPQPLIAAFNWPDIRSEVTVD 

I I ELTNGC I CCTVADDF I PTMTKLLER_ENRPDHI I I ETSGLALPQ PL I AAFNWPD IRS EVTVD 

VI ELNNGC ICCTVADDF I PTMTKLLDR_EDRPDHIVI ETSGLALPQPLVAAFNWPEIKTQVTVD 


IFPASNEAPKIVELANGCLCCTVQEEFLPTMQALLER_REEIDCIWETSGLALPKPLIQAFRWPEIRTGATVD 
IFPASNEAPKIVELANGCLCCTVQEEFLPTMQALLER_REEIDC IWETSGLALPKPLIQAFRWPEIRTGATVD 


b 

- V 

SLGNGCLCCAVDASELDGYLARLARPEAGIDVIVIEASGLAEPOELVRMLLASE  OPGIVYG 

a 

- V 

SLGNGCLCCAVDAS  ELDGYLARLARPEAGI DVIVI EASGLAEPOELVRMLLAS  E O PGIVYG 

c 

- VH 

DLSNGLIAYDDDADFLPTMQALWQR  RGTIDHVLIETSGLALPTAVMESLQSEALAPYFVLD 

* * * ****** 

i . . iaaai . siisIsA. Illliii . silsilssilSs . SSSisIIIIAIA. Hi. ssIIsilsisSiSSSIIIs 


Figure  5-14.  (Continued) 


105 


150  160  170  180  190  200  210  220 

■I I I I I I I I-- 

g - AVITWDSPAVAAGTFAAHPEQVDQQRRQDPNLDHESPLHELFEDQLASADLVILNKADQLDAEALARVRAEIA 
i - AWAWDGAALAEGQVAHDMEALAAQRANDEALDHDDPVEEVFEDQVACADLIVLTKADLLDDAGLEKAKAHIL 
h - AWAWDGAALAEGQVAHDMEALAAQRANDEALDHDDPVEEVFEDQVACADLIVLTKADLLDDAGLEKAKAHIL 
j - GVIAWDGPALAEGRVANDMDALQAQRAGDDSLDHDDPVEELFEDQIACADLIILSKSDLMDAAGSARANAIIG 
k - GVIAVADAEAVAAGRFAPDVAAVDAQRQADDIIDHETPLSEVFEDQIACADIVLLSKADLAGAEGLATARALIE 
m - GWTWDSAAVAAGRFADDHDKVDALRVGDDNLDHESPLEELFEDQLTAADLIVLNKTDLIDAAGLKSVREEVA 

0 - GWTWDSAAVAAGRFADDHDKVDALRVEDDNLDHES  P I EELF EDQLTAADL IVLNKTDL I DASGLKAVRDEVS 
n - GWTWDSAAVAAGRFADDHDKVDALRVEDDNLDHES  P I EELFEDQLTAADL  IVLNKTDL  I DASGLKAVRDEVS 

1 - GWTVIDAAAVAEGRFADDHDKVDAQRAADESLDHESPLEELFEDQIHAADLIVLNKADLIDAAKLDSVKADVA 
f - GVITWDGDALARGAMVGDLEALEAQRQADDSLDHETPIEELFEDQLACADMVLLTKTDLLGDGDQQRLENWLG 
e - GVITWDGDALARGAMVGDLEALEAQRQADDSLDHETPIEELFEDQLACADMVLLTKTDLLGDGDQQRLENWLG 
d - AWTWDCAAVASGTFASDLEAIAIQRQADDSLEHETPLQELFEDQLACADLWLSKTDLVDAATKSQVEELVK 

b - GLVEWD AAF.FD  DTRARHPEI  DRHLALADLVWMKTD  RATDAERVLGLV 

a - GLVEWD AAEFD_DTRARHPEI_DRHLALADLVWNKTD RATDAERVLGLV_ 

c - ATLAWDTPLLLSGGF DRAADDD_ATQTPIASLFEQQLANADIWLNKID 

* * ** * ^ * 

illillAssiiiS. siiss . ssisi ■ a. SSSsisISSSIssIissIIiilAIIIIsAsAiiSSSSSSSiSSSis 

230  240  250  260  270  280  290 

I I I I I I I 

g - GELPAAVKIVEASRGELPLPVLLGLNAEAELHIDGRPTHH D HEGHEDHDHDE F 

i - EHLPKAAKIWASNGAIDPTVLIGLGLAVEEDIENRRTHH DGELD HEHDD F 

h - EHLPKAAKIWASNGAIDPTVLIGLGLAVEEDIENRRTHH DGELD HEHDD F 

j - EHSARAVKIVSTSHGKVDPSVLLGLGLAVEDDIENRKSHH DGAFD  HEHDD F 

k - AELPRKLPILPLTEGVIDPKVILGLGAAAEDDLAARPSHH DDHDD HEHDD F 

m - SRI  SRKPTMI EARNGEVAAAILLGLGVGTEGD I VNRKSHH EMEHEAGEEHDHDE F 

0 - SRTSRKPTMIEAKNGEVAAAILLGLGVGTESDIANRKSHH EMEHEAGEEHDHDE F 

n - SRTSRKPTMIEAKNGEVAAAILLGLGVGTESDIANRKSHH EMEHEAGEEHDHDE F 

1 - ERSSRRVNMVPASFGKLGADVLLGLGVGTEDDIVNRRSHH EAHHGDGHEHDHDE F 

f - DQL PTGVKWPCHQGQ I S P E I LLGFNAAVEDNLDS  RPSHH DHE EDHDHDEDINAVYVALDQEF 

e - DQLPTGVKWPCHQGQISPEILLGFNAAVEDNLDSRPSHH DHE EDHDHDEDINAVYVALDQEF 

d - QELPRWKMVESDRGQLDPSILLGFQAAVEDNLDSRPSHH DTE EDHDHDDDITSTHLILDRDF 

b - HSLVGGAAWPATYGRIDPEFLY DCRPGEERVGQLSFDDLHDHSEGGAHADHLHAAYDTL 

a - HSLVGGAAWPATYGRIDPEFLY DCRPGEERVGQLSFDDLHDHSEGGAHADHLHAAYDTL 

c _ 


SSSiSSisiiS . SS . SisiSiii . isisiaSsissaSsiiai . aiaiSSSSSSSSSisa. ssisiss . . iassi 
300  310  320  330  340  350  360  370 


g - DSFHVDLPEVEE_AALLEALGELVERHDILRIKGFAAIP GKPMRLLVQGVGKRFDRHFDRKWLAD 

i - DSFVIDLPAVTDPEALAARIAQTAAAENVLRIKGFIEVA_TKPMRLQVQAVGSRVNHYYDRPWAAG 

h - DSFVIDLPAVTDPEALAARIAQTAAAENVLRIKGFIEVA__TKPMRLQVQAVGSRVNHYYDRPWAAG 

j - DTFIVDIPSIANPDELAQRVAAAAEQENVLRVKGFVEVG GKPMRLLLQAVGPRVNHYYDRAWTAE 

k - DTWIELPEIADPAALVAAIERLAREQNILRVKGHIAVA_GKPMRLLVQAVGERVRHQYDRPWGT 

m - DSFWELGAIADPAAFTERLKGVISEHDVLRLKGFVDVP GKPMRLLVQAVGSRIDQYFDRAWASG 

0 - DSFWELGSIADPAAFIDRLKGVIAEHDVLRLKGFADVP GKPMRLL I QAVGARI DQ YYDRAWGAG 

n - DSFWELGSIADPAAFIDRLKGVIAEHDVLRLKGFADVP GKPMRLLIQAVGARIDQ YYDRAWGAG 

1 - ESFWEAGSVADPKAFAEKLKAVIAEHDILRLKGFVDVP GKPMRLWQAVGPRI EHYFDRPWGKD 

f - D PAHLVQGLTDLVQDHE I YRIKGFVAVP KKAMRLVLQGVGQRFDYFYDRLWTEE 

e - D PAHLVQGLTDLVQDHEIYRIKGFVAVP KKAMRLVLQGVGQRFDYFYDRLWTEE 

d - D PKKTjOOOLOTLTNOQEIYRIKGFVAVP  NKPMRLVMOGVGNRFDKFYDRPWOPO 

b - _SFVSGLP LDPRRLMRFLDS RPKGLYRIKGYVDFGPYDTRNRYAVHAVG_RFLRFYPEPWTPAGAAGGSG 

a - _SFVSGLP LDPRRLMRFLDS RPKGLYRIKGYVDFGPYDTRNRYAVHAVG_RFLRFYPEPWTPAGAAGGSG 

c - 


Ssiiisi . siss . SSiSSSissiiSSSsiiaia. iisii . iSSSiaiiiiii . saiSsiiSSSiSSS . ii . . a . 


Figure  5-14.  (Continued) 


106 


380  390  400 


g - EARSTRLWIG_QELDQAAI 

i - EERRSRLWIG 

h - EERRSRLWIG 

j - DDRRSRLWIG 

k - EARRSALWIA 

m - QTRSTRLWIGLHDMDEPAVRAAISALV 

0 - EKRGTRLWIGLHDMDEAAVRAAITALV 

n - EKRGTRLWIGLHDMDEAAVRAAITALV 

1 - ETRSTRLWIGLHDIDEAAI 

f - DSRQTRLWIG_QGLESEKIKMAIA — 

e - DSRQTRLWIG_QGLESEKIKMAIA-  - 

d - EARQTRLVFIG_RDLNSTEI 

b - APETGRTQLVLIG 

a - APETGRTQLVLIG 

c 


i . SSSs . siiii . issiSSSsisiiiSiii 


(e) 


Figure  5-14.  (Continued) 

The  target  is  listed  as  sequence  “P”  in  subfamily  3.  The  Darwin  server  provided 


the  strong  and  weak  surface  assignments  (S,s),  the  strong  and  weak  interior  assignments 
(I,i),  and  the  strong  and  weak  "active  site"  assignments  (A, a). 

We  started  at  the  beginning  of  each  of  the  subfamily  multiple  sequence  alignments. 
In  each  case,  a stretch  of  1 0 positions  contained  either  an  interior  assignment,  or  a 
conserved  P (at  the  second  position)  or  a conserved  G (at  the  eighth  position). 


Representative  sequences  and  their  respective  positions  are  shown  in  Table  5-6  for  each 
of  the  subfamilies. 


Table  5-6.  Segment  1 : positions  043-052  from  Subfamily  3 


Subfamily 

Amino  acids 

Positions 

Representative 

sequence 

1 

MKVMIIGGFL 

007-016 

f 

2 

IPVTIVTGYL 

042-051 

f 

3 

VPVTLLTGFL 

043-052 

f 

4 

LPVTVLSGFL 

021-030 

n 

5 

LPVTIVTGFL 

011-020 

g 

107 


These  are  all  assigned  as  strands.  The  following  GxGKTT  (Table  5-7)  unit  not 
only  anchors  the  five  subfamily  alignments,  but  it  also  is  a parse.  GxG  is  well  known  as 
a marker  of  a beta-tum-beta  motif  in  the  Rossmann  fold. 


Table  5-7.  Segment  2:  positions  053-058  from  Subfamily  3 


Subfamily 

Amino  acids 

Positions 

Representative 

sequence 

1 

GSGKTT 

017-022 

APC 

2 

GAGKTT 

052-057 

f 

3 

GAGKTT 

053-058 

f 

4 

GAGKTT 

031-036 

n 

5 

GAGKTT 

021-026 

g 

The  pattern  of  interior  and  surface  assignments  immediately  following  the  motif 
suggested  a helix.  Anchoring  the  helix  is  the  iissiis  pattern  seen  in  (for  example) 
Subfamily  5 027-033.  This  is  what  prompted  us  to  try  to  fit  the  patterns  for  all 
subfamilies  to  a helical  wheel.  The  helical  wheels  are  shown  in  Figure  5-15.  For 
Subfamily  1,  a helix  was  assigned  for  023-033,  12  residues,  excluding  the  Ts  that  are  part 
of  the  preceding  anchor.  The  presence  of  the  sequence  PG  at  031-032,  however,  makes 
the  helix  only  9 positions  long.  Subfamily  2 is  also  short,  9 residues;  there  are,  however, 
many  helices  that  are  this  small.  Subfamily  3 is  a short  helix  also;  9 residues  in  long. 
Subfamilies  4 and  5 are  short  helices  as  well.  This  consistency  forced  a helix  assignment 
to  this  region. 

Following  this  is  clearly  a strand,  based  on  five  consecutive  interior  assignments. 


These  are  summarized  in  Table  5-8. 


108 


s 

I 

1 __ 

s / 

026  019  030 

V I 

v — 062  073  066 

/033 

023  \ . . 

\ break 

/ 069 

/ 022 

034A 

r 058 

—029 

TO  149  sfl 

027' 

I 

S 

—065  T0149  sf2 

^018 

022-033 

020n 

K 

057-065 

-072 

\ 025 

031  / 

\ 061 

break 


059 


070 


067 


062  073  066 

\ I 

y'  069 

059  \ 

'058 

07o\ 

065 

TO 149  sO 

063-^ 

072 

058-066 

056^ 

061 

067/ 

032 


021  017 

- 028 


068 


068 


060v 


057  064  071 


(a) 

N 

I 

(b) 

T 

s 

(C) 

044  037  048  \.  1 

S 026  037  030 

s 

/ 033 

041  \ 

\ K 

/ 033 

023  N. 

/040 

034A 

/022 

034A 

r047 

T0149  sf4  045 

break  S 

^029  T0149  sf5 

027 

I 

034-044 

026-034 

-036 

038. 

L-036 

038 

\ 043 

OSl/ 

Y 025 

OSl/ 

V032 

042 v/ 

\ 0S2 

024  / 

I 

039  046  035 j 

j \J121o28  035 

N 

T 

1 

break 

(d) 

(e) 

Figure  5-15.  Segment  3: 

helical  wheels  based  on  positions  0584-044  from  Subfamily  3. 

Table  5-8.  Segment  4:  positions  053-058  from  Subfamily  3 

Representative 

Subfamily 

Amino  acids 

Positions 

sequence 

1 

IAIIV 

038-042 

f 

2 

IAVIL 

071-075 

f 

3 

AAVIV 

072-076 

f 

4 

VAVIV 

050-054 

n 

5 

IAVIV 

040-044 

Q 

As  a check,  we  asked  whether  this  segment  could  be  fit  on  a helical  wheel.  We 


started  by  making  position  76  a surface  position  from  subfamily  2.  No  3.6  residue 
amphiphilicity  could  be  seen  as  shown  in  Figure  5-16. 

The  next  region  of  note  is  an  anchor  identified  by  a pair  of  conserved  polar  amino 
acids;  N,  E,  and  in  one  case  D.  The  subfamilies  and  respective  positions  are  shown  in 


Table  5-9. 


109 


Figure  5-16.  Helical  wheel  showing  the  lack  of  amphiphatic  behavior  for  Subfamily  3. 


Table  5-9.  Segment  5:  positions  077-078  from  Subfamily  3 


Subfamily 

Amino  acids 

Positions 

Representative 

sequence 

1 

NE 

043-044 

APC 

2 

NE 

076-077 

APC 

3 

NE 

077-078 

APC 

4 

ND 

055-056 

n 

5 

NE 

045-046 

g 

This  is  in  principle  an  active  site  segment.  We  generally  have  difficulties  assigning 
secondary  structure  to  active  site  regions.  Therefore,  we  frequently  treat  these  as  parses. 
The  only  question  is  whether  we  bring  the  strand  through  the  NE  anchor.  We  did  not  in 
this  prediction,  on  the  general  principle  that  this  is  the  active  site,  and  that  active  site 
residues  are  usually  in  a coil. 

In  the  following  region  Subfamily  1 (positions  045-050)  is  conserved.  However, 
this  subfamily  contained  too  few  proteins  and  proteins  with  too  little  divergence.  Thus, 
Subfamily  1 had  only  six  homologs,  too  few  to  support  a strong  evolution-based  structure 
prediction.  Subfamily  2 has  a gapped  region  (NExG_E)  as  well  as  a parsing  string  in 
sequence  j (GDSS).  This  is  certainly  indicative  of  a coil.  Subfamily  3 displays  a variable 
region  with  alternating  polar  and  non-polar  residues  (positions  081-084).  Subfamily  4 
and  Subfamily  5 contain  sequences  that  have  alternating  polar  and  non-polar  residues. 


110 


Based  on  these  amphiphatic  residues  and  the  parsing  string  in  Subfamily  2,  we  placed  a 
strand,  of  length  one,  in  this  region. 

The  next  anchor  is  a conserved  aspartic  acid  (D).  This  residue  is  conserved 
throughout  the  five  subfamilies  except  one;  Subfamily  2.  The  position  and  representative 
sequence  is  shown  in  Table  5-10. 


Table  5-10.  Segment  6:  position  085  from  Subfamily  3 


Subfamily 

Amino  acids 

Position 

Representative 

sequence 

1 

D 

051 

APC 

2 

D 

085 

a 

3 

D 

085 

f 

4 

D 

063 

n 

5 

D 

053 

APC 

The  next  region  shows  very  high  conservation  in  Subfamily  1 . As  a result,  this 
subfamily  is  not  informative  based  on  few  sequences.  Subfamily  2 appears  to  have  a 4 
position  strand  followed  by  a gapping  region  which  is  supported  by  the  parsing  residues 
NGSNS  in  sequence  j.  This  is  followed  by,  in  Subfamily  2,  an  amphipatic  strand; 
IYEMN,  positions  101-105  in  sequence  n.  Subfamily  3 displays  a similar  pattern  to 
Subfamily  2;  a short  strand,  high  parsing  region,  followed  by  a possible  amphipatic 
strand.  At  this  segment,  Subfamily  4 consists  of  a variable  region  which  is  supported  by 
gapping.  Following  this  variable  region,  Subfamily  4 has  a potential  strand  with  a 
conserved  glutamic  acid  (E);  LVEM  086-089  in  sequence  k.  Finally,  Subfamily  5 
contains  a hugely  gapped  coil  which  may  be  followed  by  a strand  (085-088). 

The  next  region  of  note  is  a CICC  anchor  that  aligns  all  of  the  subfamilies.  We  call 
this  a coil  and  is  most  likely  a metal  binding  site.  This  region  is  shown  in  Table  5-11. 

Let  us  examine  the  helices  that  follow.  Subfamily  1 has  a nice  amphipathic  strand 
followed  by  a helix.  Subfamily  2 has  a helix  that  is  hard  to  dispute.  Subfamily  3 has  an 


Ill 


annoying  gap  in  this  region.  It  was  necessary  to  shuffle  the  alignment  in  this  region  to 
give  a helix.  Subfamily  3 is  somewhat  less  convincing.  The  fourth  Subfamily  looks  like 
a canonical  helix.  Subfamily  5 has  a proline  (P)  in  this  region,  otherwise  a nice  helix, 
approximately  10  residues  long.  These  helices  are  shown  in  Figure  5-17. 


Table  5-11.  Segment  7:  positions  105-108  from  Subfamily  3 


Subfamily 

Amino  acids 

Position 

Representative 

sequence 

1 

CICC 

069-072 

APC 

2 

CICC 

108-111 

a 

3 

CICC 

105-108 

f 

4 

CICC 

093-096 

k 

5 

CICC 

092-095 

h 

Figure  5-17.  Segment  8:  helical  wheels  based  on  positions  1 13-125  from  Subfamily  3. 

This  helix  is  followed  by  a coil.  This  coil  is  supported  by  prolines  and  gapped 
regions  in  all  of  the  subfamilies.  Following  this  coil  region  is  a strand  (Table  5-12). 


112 


Table  5-12.  Segment  9:  positions  136-140  from  Subfamily  3 


Subfamily 

Amino  acids 

Positions 

Representative 

sequence 

1 

IVIIE 

093-097 

f 

2 

YILLE 

136-140 

f 

3 

RWIE 

136-140 

f 

4 

YLWE 

133-137 

c 

5 

QILIE 

118-122 

g 

What  follows  is  a clear  segment  that  anchors  all  of  the  subfamilies  (Table  5-13). 


Table  5-13.  Segment  10:  positions  140-143  from  Subfamily  3 


Subfamily 

Amino  acids 

Positions 

Representative 

sequence 

1 

EPTG 

097-100 

f 

2 

ETTG 

140-143 

f 

3 

ETTG 

140-143 

f 

4 

ETTG 

137-130 

c 

5 

ETSG 

122-125 

g 

Subfamilies  1 and  2 propose  4 strands  based  primarily  on  hydrophobic  stretches  of 
amino  acids.  Subfamily  3 seems  to  have  a set  of  4 strands.  Subfamilies  4 and  5 
potentially  have  3 strands.  The  argument  is  about  the  first  short  strand.  Subfamilies  4 
and  5 display  the  interior  and  surface  patterns  that  would  suggest  a helix.  To  confirm 
this,  let  us  consider  Subfamily  3,  the  subfamily  containing  the  target  sequence.  Figure  18 
shows  a lack  of  any  3.6  residue  periodicity. 


i 


Figure  5-18.  Helical  wheel  showing  the  lack  of  amphiphatic  behavior  for  Subfamily  3. 


113 


The  next  serious  anchor  is  an  aspartic  acid  (D)  which  is  conserved  throughout  the 
subfamilies  (Table  5-14). 


Table  5-14,  Segment  1 1 : position  1 83  from  Subfamily  3 


Subfamily 

Amino  acids 

Position 

Representative 

sequence 

1 

D 

130 

A PC 

2 

P 

178 

A PC 

3 

D 

183 

APC 

4 

D 

187 

APC 

5 

D 

155 

APC 

Subfamily  1 has  one  gap  in  this  region  (position  138,  sequence  c),  with  a helix  to 


follow.  Subfamily  2 has  lots  of  parsing  strings,  with  a helix  that  must  be  forced. 
Subfamily  3 has  2 gaps  in  this  region.  There  could  be  a helix  right  after  the  D,  but  it 
would  need  to  go  through  SG,  NG,  and  GN.  This  region  is  not  very  amphiphilic,  one  key 
site  would  be  an  A/G.  Asa  result  of  these  parsing  strings,  we  conclude  that  Subfamily  3 
is  nothing  other  than  a coil  in  this  region.  Subfamily  4 has  many  gaps  and  parsing 
strings,  but  with  a possible  helix  here  as  well.  Subfamily  5 can  put  a helix  through  the  Q. 


Figure  5-19.  Segment  12:  helical  wheels  based  on  positions  138-148  from  Subfamily  1. 


114 


This  is  a high  risk  prediction.  The  next  serious  anchor  is  a Q,  which  has  all 
positions  conserved  in  all  of  the  subfamilies  except  one;  Subfamily  5 (Table  5-15). 


Table  5-15.  Segment  13:  position  200  from  Subfamily  3 


Subfamily 

Amino  acids 

Position 

Representative 

sequence 

1 

Q 

148 

APC 

2 

Q 

212 

APC 

3 

Q 

200 

APC 

4 

Q 

226 

APC 

5 

Q 

194 

g 

Subfamily  1 was  identified  to  have  a single  strand  up  to  NKVD.  Subfamily  2 
appears  to  have  2 strands  between  Q and  NKVD.  Subfamilies  3 and  4 appear  to  have  2 
strands  between  Q and  NKVD  as  well.  Subfamily  5 also  has  a single  strand  up  to  anchor. 
However,  it  is  possible  that  this  region  could  be  an  internal  helix.  Let  us  check  it  out  on 
Subfamily  3 (Figure  5-20). 


Figure  5-20.  Helical  wheel  showing  the  potential  for  an  internal  helix. 

Interestingly,  the  Q and  D come  together.  D,  when  it  is  conserved,  is  making  a salt 
bridge  to  the  backbone.  Considering  some  other  segments  from  the  respective 
subfamilies,  we  conclude  that  there  is  nothing  other  than  two  strands.  The  next  serious 
anchor  is  NKVD.  The  anchor  itself  is  an  active  site  and  is  designated  as  a coil. 


115 


Table  5-16.  Segment  14:  positions  210-213  from  Subfamily  3 


Subfamily 

Amino  acids 

Positions 

Representative 

sequence 

1 

NKVD 

158-161 

f 

2 

NKTD 

222-225 

f 

3 

TKTD 

210-213 

f 

4 

NKTD 

236-239 

c 

5 

NKAD 

204-207 

g 

We  observed  that  this  region  was  quite  conserved.  The  conserved  Aspartic  acid 
and  the  Lysine  residue  suggest  that  this  residue  is  part  of  an  active  site. 

The  next  region  within  the  alignment  contains  many  gaps.  However,  the 
alternating  hydrophobic  and  hydrophilic  stretches  within  these  segments  suggest  that  a 
helix  may  be  present  throughout  this  region.  We  attempted  to  fit  the  residues  in  this 
region  into  a helical  wheel.  The  first  subfamily  was  problematic,  since  surface  residues 
dominated  the  wheel.  This  subfamily  would  have  to  contain  a short  helix  since  the  helix 
would  be  disrupted  by  proline  residues.  The  amino  acid  residues  found  at  this  region  of 
Subfamily  2 fit  readily  into  a helical  wheel,  as  can  be  seen  in  Figure  5-21 . Subfamily  3 
yields  a short  amphipathic  stretch,  if  we  ignore  one  P.  There  is  a gap.  Shifting  the  gap 
makes  a helix  as  shown  in  Figure  5-21.  It  is  the  undisputed  helix  in  Subfamily  2 that 
makes  this  case  stronger,  however.  In  Subfamily  4,  the  NP  anchor  is  almost  gone,  as  are 
many  gaps.  Removal  of  the  most  variable  (highly  gapped)  sequences  collapses  the 
sequence  to  one  which  appears  amphipathic  on  the  helical  wheel. 

The  following  parse,  which  also  represents  an  anchor,  is  an  asparagines  paired  with 
another  known  structure  breaker,  glycine,  serine,  or  proline.  In  Subfamily  5,  this  pair  is 
absent  in  all  sequences  except  one;  sequence  j,  positions  308-309. 


116 


Figure  5-21.  Segment  15:  helical  wheels  based  on  positions  221-231  from  Subfamily  1. 


Table  5-17.  Segment  16:  positions  233-234  from  Subfamily  3 


Subfamily 

Amino  acids 

Positions 

Representative 

sequence 

1 

NP 

179-180 

APC 

2 

NG 

245-246 

f 

3 

NP 

233-234 

f 

4 

NP 

257-258 

k 

5 

NG 

236-237 

m 

There  a strong  case  for  a short,  9 position  helix  right  after  the  NP  parse/anchor  for 


Subfamily  1 as  shown  in  the  helical  wheel  (Figure  5-22). 


break 


Figure  5-22.  Helical  wheel  for  positions  182-190  from  Subfamily  1. 


117 


This  region  is  followed  by  a long  gapped  region.  But  an  amphipathic  strand  can  be 
seen  at  270-280  in  Subfamily  1 . What  follows  in  Subfamily  1 is  an  obvious,  long  helix, 
positions  288-303.  The  extent  of  the  conservation  of  residues  makes  a helix  questionable 
within  this  subfamily.  This  stretch  of  Subfamily  2 supports  an  amphipathic  helix  on  a 
wheel.  The  case  for  Subfamily  3 at  this  region  is  even  stronger.  Subfamily  4 sequences 
support  a wheel  that  suggests  the  helix  may  be  shorter  due  to  the  presence  of 
interruptions  that  would  result  from  the  presence  of  proline  residues.  The  fifth  subfamily 
also  results  suggests  a helix  exists  at  this  region.  These  helices  are  shown  in  Figure  5-23. 


Figure  5-23.  Segment  17:  helical  wheels  based  on  positions  347-361  from  Subfamily  1. 

The  next  serious  anchors  are  a strong  parse  followed  by  a KG  as  shown  in  Table  5- 
18.  In  between  these  anchors  there  is  a weak  signal  of  a /?  strand  based  on  alternating 
surface  and  interior  residues.  Unfortunately,  the  alignments  are  not  terribly  reliable  in 
this  segment. 


118 


Table  5-18.  Segment  18:  positions  363-364  and  369-370  from  Subfamily  3 


Subfamily 

Amino  acids 

Positions 

Representative 

sequence 

1 

NP 

309-310 

APC 

2 

GP 

397-398 

1 

3 

SG 

363-364 

f 

4 

PG 

407-408 

g 

5 

PKG 

322-324 

a 

1 

KG 

328-329 

f 

2 

KG 

412-413 

f 

3 

KG 

369-370 

f 

4 

KG 

413-414 

k 

5 

KG 

329-330 

g 

The  conserved  basic  residues  Arg  and  Lys  at  this  region  are  separated  by  a 
hydrophobic  residue  at  this  region.  The  charged  nature  of  the  conserved  residues 
suggests  they  be  active  site  residues. 

A short  conserved  stretch  of  basic  amino  acids  follows  the  KG  anchor.  The 
presence  of  four  or  five  consecutive  hydrophobic  amino  acids  suggests  that  an  interior 
strand  may  occur  at  this  region.  It  is  curious  to  note  that  the  conserved  S at  the  contact 
between  the  surface  and  interior  arcs  of  the  wheel.  Subfamily  2 could  be,  largely  buried; 
we  also  noted  that  the  conserved  H at  the  contact  between  the  surface  and  interior  arcs  of 
the  wheel.  This  is  a case  where  multiple  families  dubiously  aligned  can  resolve  a 
secondary  structure  prediction  problem  that  none  of  them  could  solve  convincingly  by 
itself.  Subfamily  3 the  alignment  is  fragmented.  We  assumed  that  this  region  is  in  fact  a 
helix  based  on  other  family  predictions.  Although  the  fourth  subfamily  helical  wheel  is 
hopelessly  fragmented,  the  fifth  subfamily  results  in  a helical  wheel  with  at  least  passable 
amphiphilicity,  though  the  hydrophobic  residues  dominate  the  wheel  and  at  other 
positions  reside  residues  difficult  to  assign,  such  as  glutamine  and  glycine. 


The  next  serious  anchor  is  the  strand  helix  near  the  end.  The  presence  of  four 
highly  conserved  hydrophobic  residues  at  these  positions  suggests  a /?  strand  is  located 


119 


near  the  end  of  the  sequence.  The  strand  at  the  end  is  based  on  the  possible  amphipathic 
strand  in  Subfamily  1 and  the  stretch  of  hydrophobic  residues  followed  by  a conserved 
glycine  in  the  remaining  subfamilies  as  shown  in  Table  5-19. 


(c)  (d) 

Figure  5-24.  Segment  19:  helical  wheels  based  on  positions  381-392  from  the  target 
sequence. 


Table  5-19.  Segment  20:  positions  415-420  from  Subfamily  3 


Subfamily 

Amino  acids 

Positions 

Representative 

sequence 

1 

GIEIQ 

383-387 

b 

2 

KLVFIG 

451-456 

a 

3 

TMVFIG 

415-420 

P 

4 

ELVLIG 

531-536 

k 

5 

RLWIG 

378-383 

g 

120 


There  is  little  consistency  in  the  sequence  that  follows,  comparing  the  various 
subfamilies.  This  stretch  of  Subfamily  1 is  highly  charged,  and  highly  gapped,  which 
does  not  support  a helix.  Subfamily  2 is  also  highly  charged  which  would  provide  a 
wheel  dominated  by  surface  residues.  In  this  region,  the  third  and  fourth  subfamilies 
support  a passable  helix,  though  it  also  possesses  a high  percentage  of  charged  residues. 

In  this  region  of  the  alignment  the  fifth  subfamily  does  not  support  a helix.  Several  of  the 
sequences  within  this  subfamily  end  abruptly  after  the  previously  mentioned  /?  strand, 
while  others  extend  longer  in  sequences  which  are  highly  hydrophobic  in  nature. 


Figure  5-25.  Segment  21:  helical  wheels  based  on  positions  420-431  from  Subfamily  3. 


Results  for  T0149 

TO  149  is  a hypothetical  protein  encoded  by  the  gene  Yjia  of  Escherichia  coli.  This 
protein  was  selected,  as  part  of  a structural  genomics  project,  for  X-ray  crystallographic 
structure  determination  and  analysis  to  assist  with  the  functional  assignment  for  the 
CASP  5 project.  This  amino  acid  sequence  of  this  a+  J3  protein  has  no  homology  to  that 
of  other  proteins. 


121 


Figure  5-26.  Ribbon  representation  of  the  TO  149.  The  figure  on  the  right  represents  what 
we  did  right  (green)  and  what  we  missed  (red). 


f 

e 

h 

g 

i 

j 

k 

d 

c 

b 

a 

1 

m 

o 

s 


10  20  30  40  50  60  70 

| | | | | | | 

VPVTLLTGFLGAGKTTLLNELLAAPAMRDAAV 

VPVTLLTGFLGAGKTTLLNELLAAPAMRDAAV 

MSLVASSHGRFLFVPCGSRSCLALCHAGGQPSPYRQGMSRDRIPVSIITGFLGAGKSTLLNRLLKDPEMSDAAI 

MSLVASSHGRFLFVPCGSRSCLALCHAGGQPSPYRQGMSRDRIPVSIITGFLGAGKSTLLNRLLKDPEMSDAAI 

VPVS ILTGFLGAGKTTLLNRVLKDPALADTAV 
DPI PVSVLTGFLGSGKTTLLNRLLKDPALSDTAV 
PIPVSVLTGFLGAGKTTLLNRLLKDPVLADTAV 

1 P I TL I TGFLGSGKTS  FL  S EYLNQ I DHQGVAL 

IPITLITGFLGSGKTSFLSEYLNQIDHQGVAL 

IPITLITGFLGSGKTSFLSEYLNQTDHQGVAL 

I PITL I TGFLGSGKTSFLSEYLNQTDHQGVAL 

IPVTILTGFLGSGKTTLLKRILTEQHGMKIAV 

IPVTVLTGFLGAGKTTLLKYLLQAEHGMKIAV 

— P I AVTLLTGFLGAGKTTLLRH ILNEQHGFKI AV 


r - 

P* - MNPIAVTLLTGFLGAGKTTLLRHILNEQHGYKIAV 

q - 

n - — PIAVTLLTGFLGAGKTTLLRHILNEQHGFKIAV 


iaiiiaaa . aiiii . a . aaaaiiiaai . . a . a 

DSSP 


. iaa. iaaasi .iiiia.ii.i.a.iiissiissSsi.siii 
SililllllllssIilssIIsillisSisiilll 


TARG 


EXPT 

PRED 


PIAVTLLTGFLGAGKTTLLRHILNEQHGYKIAV 

A B C 


T ■ 

EEEEEEEE  HHHHHHHH  EEE 

EEEEEEEEEE  HHHHHHHHH  EEEEE 


Figure  5-27.  The  alignment  for  Target  TO  149.  The  sequence  for  which  the  crystal 
structure  is  unknown  is  p,  and  is  indicated  by  a *. 


tJ'roi-h  nawhinogHtiffo 


122 


80 


90 


100 


110 


120 


130 


140 


f 

e 

h 

g 

i 

3 

k 

d 


IVNEFGSVPIDH — GLVRTGSERYFHTTTGCICCTASSDIRTSL_YELHQVSLEGVMPEISRWIETTGLADPA 
IVNEFGSVPIDH_GLVRTGSERYFHTTTGCICCTASSDIRTSL_YELHQVSLEGVMPEISRWIETTGLADPA 
IINEFGDVSIDH — .MLVESAGEGIIELAEGCLCCTVRGELVDTL_AELIDGIQTGKLKPVKRWIETTGLADPA 
IINEFGDVSIDH — MLVESAGEGI I ELAEGCLCCTVRGELVDTL_AEL IDG I QTGKLKPVKRWI ETTGLADPA 
IINEFGDVSIDH — LLVEASSDGVIELSDGCLCCTVRGELVDTL_ADLMDRMQTGRIKPLKRWI ETTGLADPA 

IINEFGEVSIDH LLVEQASEGVIELADGCLCCTVRGELVDTL_ADLIDRLQTGRIKALKRVIIETTGLADPA 

I INEYGEVAIDH LLVEQASDGIIQLSDGCLCCTVRGELVDTL_ADLVDRLQTGRIARLARVIVETTGLADPA 

IINEIGQAALDQRILSVQYCGEKMLYLNAGCVCCNKRLDLVESLKATLNNYEWRGEI LKRVI I ETTGLANPA 

IINEIGQAALDQRILSVQYCGEKMLYLNAGCVCCNKRLDLVESLKATLNNYEWRGEI LKRVI I ETTGLANPA 

IINEIGQAALDQRILSVQYCGEKMLYLNAGCVCCNKRLDLVESLKATLNNYEWHGEI LRRI I I ETTGLANPA 

IINEIGQAALDQRILSVQYCGEKMLYLNAGCVCCNKRLDLVESLKATLNNYEWHGEI LRRI I I ETTGLANPA 

IENEFGEENIDNDIL_VQDGNEQIVQMSNGCICCTIRGDLVAAL_SDLLTRRDKGEL_QFDRWIETTGVANPG 

IENEYSETPID_GQL_LGVEPVQVMTLSNGCVCCSINTDLEKAL_FLLLERRDNGEI_DFDRLVIECTGLADPA 

IENEFGEVSVD_DQL_IGDRATQIKTLTNGCICCTRSNELEDAL_LDLLDSRDRGDI_AFDRLVIECTGMADPG 

IENEFGEVSVD_DQL_IGDRATQIKTLTNGCICCSRSNELEDAL_LDLLDNLDKGNI_QFDRLVIECTGMADPG 

IENEFGEVSVD_DQL_IGDRATQIKTLTNGCICCSRSNELEDAL_LDLLDNLDKGNI_QFDRLVIECTGMADPG 

IENEFGEVSVD_DQL_IGDRATQIKTLTNGCICCSRSNELEDAL_LDLLDNLDKGNI_QFDRLVIECTGMADPG 

IENEFGEVSVD_DQL_IGDRATQIKTLTNGCICCSRSNELEDAL_LDLLDNLDKGNI_QFDRLVIECTGMADPG 

IENEFGEVSVD_DQL_IGDRATQIKTLTNGCICCTRSNELEDAL_LDLLDSRDRGDI_AFDRLVIECTGMADPG 

*•** * **.** . *...*.**.*.*. 

IiAAI . si . IAisiiilss . isilisl . s . AIAAS . s . sIssilaislisSSiS . SIsSISAIIIAIA. IIs . i 
iliSiiSIsis  sii  slSisIillilililimisisIiill  Illlilisssil  sliillllllisiill 


g - 

i - 

j - 

k - 
d - 
c - 
b - 
a - 
1 - 
m - 
o - 
s - 
r - 

p*_ 

g - 

n - 


D 

E 

F 

G 

H I 

iir> 

1 

] 

EE 

EE  E E 

E EEE 

EEEE 

HHHHH  HHHHHHHH 

EEEEEEE  HH 

EE 

E EE  EE 

EEEE 

HHHHHHHH  HHHHHHH 

EEEE  EEEE 

150 

160 

170 

180 

190  200 

210  220 

* i I I | | . . 

PIINSLIAGGTPALGLRDHIVARHFHLSGWTVFDVI SGRAMLDSFIEGWKQLAFADHIVLTKTDLADA  A 

PIINSLIAGGTPALGLRDHIVARHFHLSGWTVFDVISGRAMLDSFIEGWKQLAFADHIVLTKTDLADA A 

PVMQSVM — GNPVI AQNFVLNGMITWDAVNGLSTLDNHEEAVKQAAVADRLVMTKRTLADASA_IA 

PVMQSVM — GNPVI AQNFVLNGMITWDAVNGLSTLDNHEEAVKQAAVADRLVMTKRTLADASA_IA 

PVLQSVL — GNPVI AQNFRLDGWTWDAVNGEQTIANHVEAMKQVAVADRLVISKSGLAKDDG_VG 

PVLHS IM — GHPLL TQVFRLDGVLATVDAVNGMATLDNHEEAVKQAAMADRI ILTKTDLPEAQAGLP 

PVLQSIM — AHPAL VQAFRLDGVIALVDAVNGNATLDAHVEAVKQVAVADRIVLSKVDLVTDTGDLD 

PILWTN LSDVFLGAHFEIQSWACVDVLNAKTHLTNN_EAKEQIVFADSVLLTKTDLQNDSTALM 

PILWTN LSDVFLGAHFEIQSWACVDVLNAKTHLTNN_EAKEQIVFADSVLLTKTDLQNDSTALM 

PILWTI lsdtflgvhfeiqswacvdalnarehltnn_eakeqivfadsvlltktdlqndsaalt 

PILWTI LSDTFLGVHFEIQSWACVDALNAREHLTNN_EAKEQIVFADSVLLTKTDLQNDSAALT 


_DDEIAARYLLDAVITLVDAKHGNQQLDRQEEAQRQVGFADAIFITKGDLVSD_DDVE 
_DEELCQRYVLDGIITLVDAANAERHLQE_TIAQAQVGFADRILVSKTDLVDA_ATFE 
_HDVLCERYLLDGVIALVDAVHANEQMNQFTIAQSQIGYADRILLTKTDV__A_GDSE 
_HEILCQRYLLDGVIALVDAVHADEQMNQFTIAQSQVGYADRILLTKTDV  A GRAF 
_HEILCQRYLLDGVIALVDAVHADEQMNQFTIAQSQVGYADRILLTKTDV  A GEAR 
_HEVLCQRYLLDGVIALVDAVHADEQMNQFTIAQSQVGYADRILLTKTDV  A GEAF. 
_HEVLCQRYLLDGVIALVDAVHADEQMNQFTIAQSQVGYADRILLTKTDV  A GEAE 
_HDVLCERYLLDGVIALVDAVHANEQMNQFTIAQSQIGYADRILLTKTDV__A_GDSE 

* *__  ** * 

.Il.iSii. . s. ii. is .S.iiSslils. IIIIIAIisiSSIIss . siiisAIillAsIIIsAssI . sssSSiS 
illsIIiS  iSiliSilsIIIIIIlllliiiisIIsSisilllllllllllliiiiSs  I Ssls 

PIIQTFFS  HEVLCQRYLLDGVIALVDAVHADEQMNQFTIAQSQVGYADRILLTKTDV  A GEAE 

J K L M N 0 


PVAQTFFM 

PVAQTFFA. 

PIIQTFFY. 

PIIQTFFS. 

PIIQTFFS. 

PIIQTFFS. 

PIIQTFFS. 

PIIQTFFS. 


PW 

HHHHHHHH 

EEEEEE 


HHHHHHEEEEEEEEEEE 
EEE  EEEE  EEEEE 


■> 

HHHHHHH  HHHHHHHHH  EEEEE 
EEEE  EEEE 


HH 

HH 


Figure  5-27.  (Continued) 


ffo  P-IQ  S'd) 


123 


230 


240 


250 


260 


270 


280 


290 


f 

e 

h 

g 

i 

j 

k 

d 

c 

b 

a 

1 

m 

o 

s 

r 

P 

q 

n 


_GHD_ 

GHD 


- aidqksltdlnpaahlhdrnaadfdlmslfstkdyslhgktedvpgwlaadrlsl  ha 

- aidqksltdlnpaahlhdrnaadfdlmslfstkdyslhgktedvpgwlaadrlsl  ha 

- altar_lealnpratiedgdkadwsaaglldnglydvssknadvgrwlgees  ghgh 

- ALTAR_LEALNPRATIEDGDKADWSAAGLLDNGLYDVSSKNADVGRWLGEES GHGH 

- RLEAR_LRDLNPRAP I IDGDTEEAGRADLFACGLYDATTKVADVGRWLQDEAHAD GHGHV 

ALKAR_LRTLNPGADILEAGEERTGYAALFECGLYNPQTKSADVRRWLKAEAYED EHHGHVCGPDCGHDHH 

- TLRAR_LRQINPGABLLDAGHLTTGVAALFDCGLYNPATKSADVRRWLGEEAAHD  HAHGH HHDGH 

- KLKER_IQSLNPSAEIFDKKNIDYE SFFSRKNGARNFMLRMPKDSHSQ 

- KLKER_I Q SLNPSAE I FDKKNI DYE SFFSRKNGARNFMLRMPKDSHSQ 

- KLKER_IQALNP SAE I FDKRAIDYE SLFSRKNRARNFMPRMPKDSHSQ ~ 

- KLKER_IQALNPSAEIFDKRAIDYE SLFSRKNRARNFMPRMPKDSHSQ 

- ALRHRLLH_MNPRAP I RTANFGEAP IDT I FDLRGFNLNAKLE I DPDFLREDDHDHDHGHEHNHAC S PDCDHD 

- ALGQRLQR_INRRALVHWEHGRIDLAHLLDVRGFNLNADL 

- KLRERLAR_INARAPVYTWHGDIDLSQLFNTSGFMLEENV 

- KLRERLAR_INARAPVYTVTHGDIDLGLLFNTNGFMLEENV 

- KLRERLAR_INARAPVYTVTHGDIDLGLLFNTNGFMLEENV  “ 

- KLHERLAR_INARAPVYTVTHGDIDLGLLFNTNGFMLEENV 

- KLHERLAR_INARAPVYTVTHGDIDLGLLFNTNGFMLEENV  ~ 

- KLRERLAR_INARAPVYTWHGDIDLSQLFNTSGFMLEENV 


sIS . silssIA. slsIssiSSssIsiiSiisiS. IsissSi .si. .iisss. .s . aa. ssasa. a . . aasaasa 
sliillll  IlllliiiilsSSsIsIiillsIillsiiisI 
KLHERLAR  INARAPVYTVTHGDIDLGLLFNTNGFMLEENV 


y 

HHHHHHHH  H 
HHHHHHHH  H 

300 


EEE 


HHHH 

EE 


310 


320 


330 


340 


350 


360 


370 


f - 


a 

1 

m 

o 

s 

r 

P 

q 

n 


_HDHDHDHDHAHG 

_HDHDHDHDHAHG 

_HDHHHEDGHGHGH_ 
HHDHHHDHHNDQGY_ 
DHGHDHDHSHDHG 


HDINRHGAGVEAFDLTFDGALDPQAIVTFLELVTSNVHSGLLRLKG 

_HDINRHGAGVEAFDLTFDGALDPQAIVTFLELVTSNVHSGLLRLKG 


_HEHGHDHHHGHG. 


_HHHHHDVNRHDASIRSFSIIHDQPIDPMAIDMFVDLLRSAHGEKLLRMKA 
_HHHHHDVNRHDAS IRS F S I IHDQ P I D PMA I DMFVDLLRSAHGEKLLRMKA 
_HHHHHDVNRHGSDIRSFSIVHDRPIEPMALEMFIDLLRSAHGEKLLRMKA 

_AHHHH HDDAIRSFSLRHDAPIPVSTFEMFLDLLRSTHGEKLLRMKG 

HR HDSRVRSY SLVHDGPVPF  SAIEMFLDLLRSTHGEKLLRMKG 

GFETLSVSFEGTMEWSAFGIWLSLLLHQYGTQILRIKG 

GFETLSVSFEGTMEWSAFGIWLSLLLHQYGTQILRIKG 

— GFETLSINFEGTMEWSAFGIWLSLLLHQYGTQILRIKG 

GFETLSINFEGTMEWSAFGIWLSLLLHQYGTQILRIKG 


HHHHHGHAHHTDRIASFVFRSERPFNYTKLEEFLSGVLNVYGEKLLRYKG 

GPGIGLRPLRAVAAK. DSRDRIGTLVLRSDTPLDLERLSEFMDDLLQWHGNSLLRYKG 

LASQPRFHF I ADKQNDVS  S I WELDYPVD I S EVSRVMENLLLESADKLLRYKG 

.VSTKPRFHF I ADKQND I S S I WELDYPVD I SEVSRVMENLLLE SADKLLRYKG 

VSTKPRFHF I ADKQND I S S I WELD YPVD I S EV SRVMENLLL E SADKLLRYKG 

V STKPRFHF I ADKQNDI S S I WELDYPVDI S EVSRVMENLLLESADKLLRYKG 

.VSTKPRFHF  IADKQNDI SS IWELDYFVDI SEVSRVMENLLLESADKLLRYKG 

LASQPRFHFIADKQNDVSSIWELDYPVDISEVSRVMENLLLESADKLLRYKG 


sasasassssss . s . .a. .a.i.SS. . isiiisisssls . IilSissilsisilsillsili . siissIIAIAi 

IiissSsisSiissIsIillllilSIIIiissIssIIsillssIIsilHIII 

VSTKPRFHFIADKQNDISSIWELDYPVDISEVSRVMENLLLESADKLLRYKG 

R S T u 


HHH 


Figure  5-27.  (Continued) 


EEEEEEEE 

EEEEE 


E HHHHHHHHHHHHHHH  EEEEEE 
HHHHHHHHHHHHHHH  EEEEE 


124 


380  390  400  410  420  430  440 


„ I I I I I I 

f - I FGATDDPERPWAHAV QHRLY PL I RL E SWPD_GDRS TRIVMI GMDVPQQ P I RDLFNALAAQAG 

e - IFGATDDPERPWAHAVQHRLYPLIRL ESWPD_GDRSTRIVMIGMDVPQQPIRDLFNALAAQAG 

h - IVKLSDNPGRPLVLHGVQNIFHTPERL AAWPDPTDQRTRMVLITKDLPEAFVKDLFAAFTGTPGIDRP 

g - IVKLSDNPGRPLVLHGVQNIFHTPERL AAWPDPTDQRTRMVLITKDLPEAFVKDLFAAFTGTPGIDRP 

i - IVSVADNPERPWLHGVQTVFHAPERL AAWPDPADRRTRMVLITKGLDEAFVRDLFDAFTGKPRVDRP 

j - IVQIAEDPDRPWIHGVQKIFHPPARL PQWPQ_GKRETLLVLIVKDLPEAYVRELFDAFLGRPGLDRP 

k - VIELSEDPSRPLVIHGVQKILHPPARL PAWPD_GQRGTRLVLITLDMPEDYVRRLFAAFTNRPSIDTP 

d - IIDIGSD LLVSINGVMHVIYPP 

c - IIDIGSD LLVSINGVMHVIYPP 

b - IIDIGSG FLVSINGVMHVIYPP 

a - IIDIGSG FLVSINGVMHVIYPP 

1 - VLYM_EGVDRKWFQGVH 
m - VLNI_ADEPRRLVFQGVL 
o - MLWI_DGEPNRLLFQGVQ 
S - MLWI_DGEPNRLLFQGVQ 
r - MLWI_DGEPNRLLFQGVQ 
p - MLWI_DGEPNRLLFQGVQ 
q - MLWI_DGEPNRLLFQGVQ, 
n - MLWI_DGEPNRLLFQGVQ, 


,QLMGSDVGGKWDG_ETPGNRMVFIGVDLPRDTI 

.RLYGFDWDSEWRDDEARESVIVFIGDNLPEDSIREGF 

RLYSADWDRPW_GDETPHSTLVFIGIQLPEDEIRAAF 

RLYSADWDRPW_GDEKPHSTMVFIGIQLPEEEIRAAF 

RLYSADWDRPW_GDEKPHSTMVFIGIQLPEEEIRAAF 

RLYSADWDRPW_GDEKPHSTMVFIGIQLPEEEIRAAF 

RLYSADWDRPW_GDEKPHSTMVFIGIQLPEEEIRAAF 

RLYSADWDRPW_GDETPHSTLVFIGIQLPEDEIRAAF 


* 

IIsI . sSsss . Iiliills . iisissii . iaisssissssSssisiiiisssissS . issiisii . ssisias . 


Ilil  sSisillllllli 


sissssiiisl  SSiSiilllllllisIIiSslisiliSIss 


MLWI  DGEPNRLLFQGVQ 

V 


EECE  EEEEEEEE 

EEEE  HHHHHHH 


RLYSADWDRPW  GDEKPHSTMVFIGIQLPEEEIRAAFAGLRK 

W X Y 


EEEEEEEEE  EEEEEEEE  HHHHHHHHH 

hhhhh  eeeeee  HHHHHHHHHHHH 


Figure  5-27.  (Continued) 


Target  TO  149  has  25  secondary  structural  elements.  In  our  prediction,  we  predicted 


15  elements  correctly  (A,  B,  C,  D,  E,  G,  H,  K,  N,  O,  S,  T,  U,  X,  and  Y),  6 elements 


incorrectly  (I,  J,  Q,  M,  V,  and  W),  and  we  missed  4 assignments  (F,  L,  P,  and  R).  The 
mispredicted  element  Q and  the  missed  element  R both  appear  to  reflect  erroneous 
assignment  of  secondary  structure  by  the  contest  auto-judge.  There  are  no  a helices  that 
cover  just  four  residues.  Inspection  of  the  multiple  sequence  alignment  showed  that  these 
elements  were  aligned  to  gapped  regions  and,  therefore,  were  non-core.  Helix  J is  also 
and  example  of  a secondary  element  that  was  aligned  to  a gapped  region.  Helix  J is  6 
residues  (167-172)  in  length  while  the  multiple  sequence  alignment  shows  gapping  in  the 


125 


first  3 positions.  These  are  examples  of  the  first  type  of  error  associated  with  difficulties 
in  the  method  themselves. 

The  first  incorrect  assignment  of  TO  149  is  helix  I.  The  experimental  structure 
assignments  identifies  a helix  on  segment  147-156  of  the  subfamily  that  includes  the 
target  sequence,  Subfamily  3.  The  first  three  positions  of  the  helix  are  proline  (P), 
glycine  (G),  and  proline  (P).  This  is  very  surprising.  In  fact,  PG  and  GP  are  among  the 
top  twelve  dipeptides  that  break  secondary  structures  (Table  4-2). 

Incorrectly  assigning  interior  helices  arises  from  the  naivety  of  the  simple  heuristic. 
Internal  helices  are  not  expected  to  reveal  themselves  from  patterns  in  their  surface  and 
interior  assignments.  It  is  quite  clear  that  internal  helix  M was  missed  because  of  this 
problem  weakness  in  the  heuristic.  A helix  was  identified  in  the  experimental  structure, 
but  a short  strand  was  predicted. 

A helix  was  predicted  where  the  experimental  structure  identifies  two  strands,  V 
and  W.  This  segment  is  highly  gapped.  Returning  to  the  notes  of  during  the  prediction,  it 
was  evident  that  a helix  was  predicted  based  on  the  evidence  of  the  other  subfamilies. 

Conclusion 

This  dissertation  confirms  that  evolution-based  prediction  tools  can  produce 
excellent  secondary  structural  models  when  an  adequate  number  of  sequences  having 
adequate  evolutionary  divergence  are  used  as  input.  Where  evolution-based  methods  did 
poorly,  the  poor  performance  could  in  general  be  traced  to  few  homologous  sequences  for 
the  target  or  inadequate  sequence  divergence  among  the  homologs  within  the  family. 

One  prescription  for  improvement  is  clear  from  this  observation:  more  sequences 
need  to  be  collected.  This  will  be  the  inevitable  outcome  of  genome  projects.  As  the 


126 


sequence  databases  grow,  fewer  protein  families  will  be  small  (in  their  representation  in 
the  database),  and  the  quality  of  evolution-based  predictions  should  improve  accordingly. 

Nevertheless,  evolution-based  methods  continued  to  have  difficulties  assigning 
secondary  structure  near  active  sites  and  distinguishing  between  internal  strands  and 
internal  helices.  Therefore,  one  still  cannot  be  certain  that  secondary  structure  models 
produced  by  evolution-based  methods  are  free  of  all  serious  mistakes,  even  when 
adequate  diversity  is  contained  within  the  protein  family  being  examined.  Thus,  any 
model  needs  to  be  inspected  in  detail,  and  full  transparent  predictions  that  call  attention  to 
possible  serious  mistakes  remain  an  important  part  of  a prediction.  The  emergence  of 
fully  automated,  transparent  prediction  tools  (such  as  Darwin)  should  combine  the 
informative  nature  of  a transparent  prediction  with  the  convenience  of  an  automated 
prediction. 

With  respect  to  methods  for  scoring  evolution-based  predictions,  CASP  5 also 
confirmed  conclusions  that  were  established  earlier.  Q$  scores  are  not  appropriately  used 
in  evaluating  predictions,  even  as  a cutoff  to  distinguish  predictions  worthy  of  closer 
inspection  from  those  that  are  not.  If  the  experimental  structure  is  for  a protein  with  large 
segments  introduced  in  addition  to  the  core  segments,  the  Q$  scores  can  be  arbitrarily 
low. 

Future  Work 

One  cannot  help  but  be  impressed  by  the  progress  that  the  summary  above 
represents.  In  the  1 980s,  the  only  method  to  predict  a folded  structure  of  a protein  was  to 
identify  it  as  a homolog  of  a protein  with  a known  structure,  or  to  be  assisted  by 
experimental  information  (most  notably  circular  dichroism  spectra)  that  indicated  that  a 
protein  adopted  a regular  class  of  fold  (generally  all  helical).  Today,  tools  are  available 


127 


that  have  permitted  the  construction  of  models  of  secondary  structure  that  are  useable  for 
other  purposes. 

It  would  be  a mistake  to  dismiss  this  progress  as  an  inevitable  outcome  of  having 
more  sequence  data.  Evolution-based  predictions  do,  of  course,  incorporate  more 
information  than  a classical  prediction.  Additional  information  certainly  cannot  hurt 
prediction,  if  only  by  allowing  “noise”  to  be  averaged  out.  To  the  extent  to  which 
mistakes  in  classical  predictions  arise  from  “noise,”  then  averaging  the  predictions  over 
several  homologs  should  diminish  mistakes. 

First,  evolutionary  considerations  about  how  natural  selection,  protein  stability,  and 
conformation  showed  the  nature  of  the  problem.  As  the  products  of  natural  selection, 
natural  proteins  have  evolved  to  violate  folding  rules  to  engineer  a desired  level  of 
instability.  As  organic  molecules,  proteins  should  have  local  conformations  that  are 
influenced  by  long-range  interactions.  These  observations  suggested  that  classical 
prediction  methods  based  on  single  sequences  would  not  work,  indeed  could  not  work, 
for  the  general  protein.  These  suggestions,  in  turn,  guided  work  toward  areas  that 
ultimately  proved  to  be  more  productive,  work  that  focused  on  identifying  elements  of 
tertiary  structure  (in  particular,  surface  accessibility),  constrained  ways  for  using  patterns 
of  variation  and  conservation  as  indicators  of  tertiary  structure,  and  exploited  manual 
analysis  of  homologous  protein  sequences  to  speed  the  development  of  insight  that,  in 
turn,  speeded  the  development  of  improved  prediction  heuristics. 

The  balance  between  evolutionary  theory,  chemical  principles,  and  massive 
amounts  of  sequence  data  may  well  be  useful  in  analyzing  problems  generally  in 
biological  chemistry,  including  the  role  of  biological  macromolecules  in  differentiation 


128 


and  development,  the  design  of  biological  pathways,  and  the  biological  chemistry  of 
disease.  If  so,  then  this  balance  in  the  protein  structure  prediction  field  may  serve  as  a 
model  for  a significant  part  of  the  future  development  of  biological  chemistry. 

Much  remains  to  be  done,  however.  Approaches  that  model  the  conformation  of  a 
target  protein  from  the  known  conformation  of  a homologous  protein  are  quite 
successful,  but  only  to  the  extent  that  the  target  and  reference  structures  are  the  same.  To 
the  extent  that  the  target  and  reference  proteins  do  not  have  the  same  conformation, 
homology  modeling  confronts  directly  the  most  difficult  problems  in  contemporary 
physical  chemistry:  How  to  model  quantitatively  the  interaction  of  molecules  and 
molecular  fragments  with  each  other,  especially  in  solution,  especially  when  the  solvent 
is  water.  Much  more  work  must  be  directed  toward  understanding  the  underlying 
physical  chemical  issues  involved  in  this  interaction,  both  in  proteins  and  in  small 
molecules. 

Even  if  ab  initio  tools  based  on  evolutionary  information  work  at  the  level  of  the 
secondary  structure,  they  do  not  represent  a comprehensive  solution  to  the  structure 
prediction  problem.  At  best,  an  ab  initio  secondary  structure  prediction  will  identify  a 
homolog  of  the  target  protein  in  the  crystallographic  database.  This  converts  the  ab  initio 
problem  into  a homology  modeling  problem,  and  the  problems  associated  with  homology 
modeling  must  then  be  solved. 

Even  if  ab  initio  tertiary  structure  modeling  from  predicted  secondary  structural 
elements  becomes  routine,  however,  the  problem  is  not  solved.  Given  a consensus  model 
for  tertiary  structure,  most  users  want  to  proceed  to  a model  for  the  conformation  of  a 
specific  protein  in  the  family.  This  is,  of  course,  another  problem  in  homology  modeling, 


129 


with  the  specific  protein  being  the  target  structure  and  the  consensus  model  being  the 
homolog.  It  therefore  also  confronts  the  central  problems  in  physical  chemistry 
mentioned  above. 

Thus,  virtually  all  lines  of  progress  in  ab  initio  prediction  merely  reduce  the 
problem  to  one  of  homology  modeling,  which  must  then  confront  and  resolve  problems 
in  physical  chemistry  that  are  difficult  to  resolve.  The  message  is  clear:  sooner  or  later 
the  physical  chemical  problems  alluded  to  above  will  need  to  be  solved. 

Further,  a realist  must  point  out  that  structure  prediction  has  a competitor: 
experimental  structure  determination.  During  the  time  that  modeling  has  made  the 
advances  outlined  in  this  review,  crystallography,  electron  microscopy,  and  NMR 
analyses  of  protein  structure  have  also  made  dramatic  progress.  Assisted  by  molecular 
biological  tools  yielding  proteins  in  large  amounts,  a rationalization  of  conditions  for 
crystallizing  proteins,  new  methods  for  phasing  diffraction  data,  and  computational 
advances  that  speed  the  solution  of  the  structure,  the  number  of  crystal  structures  per  year 
is  about  10-fold  higher  today  than  it  was  a decade  ago.  To  this  is  added  increasing 
numbers  of  structures  determined  by  NMR  methods. 

The  general  problem  of  structural  biology  is  not  unbounded.  The  number  of 
families  of  proteins  readily  recognizable  by  sequence  similarities  will  be  less  than  10,000 
when  the  genomes  of  all  organisms  on  the  planet  are  sequenced  (Gonnet  et  al.  1992). 

The  number  of  distinct  folds  may  be  less  than  1000  (Chothia  1992).  At  some  point, 
experimental  analysis  of  protein  structure  becomes  similar  to  the  analysis  of  other  types 
of  chemical  structure.  A good  analogy  is  the  work  done  between  1850  and  1950  to 


130 


identify  all  of  the  elements  in  the  Periodic  Table.  After  1950,  the  elements  were  all 
known,  and  the  research  problem  became  obsolete. 


CHAPTER  6 
CONCLUSION 

This  dissertation  reported  a fundamentally  new  approach  for  extracting 
conformational  information  from  an  alignment  of  homologous  proteins.  A reliable 
algorithm  is  developed  for  identifying  surface  and  interior  residues  in  a protein.  An 
algorithm  is  also  developed  for  identifying  parsing  residues.  Finally,  we  have  used 
algorithms  to  make  a prediction  regarding  the  secondary  structural  segments  of  proteins. 

Clearly,  much  additional  information  remains  to  be  extracted  from  these 
alignments.  Approaches  that  model  the  conformation  of  a target  protein  from  the  known 
conformation  of  a homologous  protein  are  quite  successful,  but  only  to  the  extent  that  the 
target  and  reference  structures  are  the  same.  To  the  extent  that  the  target  and  reference 
proteins  do  not  have  the  same  conformation,  homology  modeling  confronts  directly  the 
most  difficult  problems  in  contemporary  physical  chemistry:  How  to  model  quantitatively 
the  interaction  of  molecules  and  molecular  fragments  with  each  other,  especially  in 
solution,  especially  when  the  solvent  is  water.  Much  more  work  must  be  directed  toward 
understanding  the  underlying  physical  chemical  issues  involved  in  this  interaction,  both 
in  proteins  and  in  small  molecules. 


131 


APPENDIX  A 

SUPPLEMENTARY  MATERIAL  FOR  THE  CREATION  OF  THE  DATABASE 

# ! /usr/bin/perl  -w 
use  strict;  use  DBI; 

####################################################################### 

# mySQL  password  is  the  argument :>  perl  file.pl  password  >output  out 

####################################################################### 
my  $Dbh=DBI->connect ( "DBI :mysql :MC : sql " , "dekee",  $ARGV[0] ) or  die  $!; 

####################################################################### 

# Get  ALL  DISTINCT  Seqlds 

####################################################################### 
my  %Seq=  ( ) ; 

my  $Sth=  $Dbh->prepare(" SELECT  DISTINCT  Seqld  FROM  SeqAnnotation 
WHERE  Type  BETWEEN  30000  AND  30006"  ) or  die  $!; 

$Sth  ->  execute  or  die  $!; 

while  ( my  ( $SeqId  ) = $Sth->f etchrow_array  ) {$Seq{$SeqId}  = 0;} 

#Flag  all  seqs  as  not  a duplicate 
my  $n  = 0; 

foreach  my  $SeqId  ( keys  %Seq  ){  $n++;  } 
print  "Seqs  with  structs  = $n\n" ; 

####################################################################### 

# Make  sure  the  sequences  are  in  cat  id  3 (non- redundant  protein  seqs) 

####################################################################### 
$Sth  = $Dbh->prepare ( 

"SELECT  Famld  FROM  Module  WHERE  Seqld  = ? AND  Catld  =3")  or  die  $'; 
foreach  my  $SeqId  ( keys  %Seq  ) { 

$Sth->execute ( $SeqId  ) or  die  $!; 

delete  $Seq{$SeqId)  if  $Sth->rows  <1;  #A  LINUX  ONLY  STATEMENT?? 

$n  = 0; 

foreach  my  $SeqId  ( keys  %Seq  ) { $n++;  } 
print  "Seqs  with  structs  in  cat  id  #3  = $n\n"; 

####################################################################### 

# Get  all  families  for  the  seqs 

####################################################################### 
my  %Fam  = ( ) ; 

$Sth  = $Dbh->prepare ( 

"SELECT  DISTINCT  Famld  FROM  Module  WHERE  Seqld  = ? AND  Catld  = 3" 

) or  die  $ ! ; 

$n  = 0; 

foreach  my  $SeqId  ( sort  keys  %Seq  ) { #Loop  thru  every  seq  w / struct 
$Sth->execute ( $SeqId  ) or  die  $!; 

while  ( my  ( $FamId  ) = $Sth->f etchrow_array  ) { #Loop  thru  fams  4 seq 
push  @ { $Fam{ $FamId} } , $SeqId;  #Put  seq  into  fam 

$n+  + ; 


132 


133 


} 

} 

print  "Fam  x Seq  = $n\n"; 

$n  = 0; 

foreach  my  $FamId  ( keys  %Fam  ){  $n++;  } 
print  "Fams  covered  by  seqs  = $n\n"; 

####################################################################### 
^ Mark  Sequences  if  there  is  more  than  1 crystal  structure 

####################################################################### 
foreach  my  $FamId(  keys  %Fam  ){ 

next  unless  $# {$Fam{ $FamId} } ; 

foreach  my  $SeqId  ( @{$Fam{$FamId} } ){ 

$Seq{$SeqId}++;  #Flag  seq  as  a duplicate 

} 

} 

####################################################################### 

# Count 

####################################################################### 
my  %count  = ( ) ; 

foreach  my  $SeqId  ( keys  %Seq  ){  $count{$Seq{$SeqId} }++;  } 
while  ( my  ( $key,  $val  ) = each  %count  ) { 
print  ”Key=$key\tHits=$val\n" ; 

} 

####################################################################### 

# Test  all  sequences 

####################################################################### 
print  "Checking  the  results  of  everything  in  key=0\n"; 

$Sth  = $Dbh->prepare(" SELECT  DISTINCT  Famld  FROM  Module  WHERE  Seqld  = ? 

AND  Catld  = 3 " ) or  die  $ ! ; 

my  $Sth2  = $Dbh->prepare ( "SELECT  DISTINCT  Seqld  FROM  Module  WHERE  Famld 

= ? AND  Catld  = 3 " ) or  die  $ ! ; 

$n  = 0; 

foreach  my  $ Seqld  ( keys  %Seq  ) { 

next  if  $Seq{ $SeqId}  > 0;  $n++; 

$Sth->execute ( $SeqId  ) or  die  $!; 

while  ( my  ( $FamId  ) = $Sth->f etchrow_array  ) { 

$Sth2->execute ( $FamId  ) or  die  $!; 
while  ( my  ( $DupId  ) = $Sth2->f etchrow_array  ) { 
next  unless  defined  $Seq{$DupId} ; 

print  "Error  $SeqId  $DupId\n"  if  $SeqId  !=  $DupId; 

} 

} 

print  "Checked  $n  sequences'^"; 

####################################################################### 

# Output  all  sequences  with  one  & only  one  crystal  across  its  families 

####################################################################### 
print  "Sequences  follow\n" ; 

foreach  my  $SeqId  ( keys  %Seq  ) { 

print  " $SeqId\n"  unless  $Seq{ $SeqId} ; 

2 

Figure  A-l.  GetSeqIds.pl 


134 


GI  identifiers  are  assigned  by  NCBI  for  all  sequences  contained  within  NCBI’s 
sequence  database.  The  gi  identifier  provides  a uniform  and  stable  naming  convention 
whereby  a specific  sequence  is  assigned  its  unique  gi  identifier.  If  a protein  sequence 
changes,  however,  a new  gi  identifier  is  assigned,  even  if  the  accession  number  of  the 
record  remains  unchanged.  Thus  gi  identifiers  provide  a mechanism  for  identifying  the 
exact  sequence  that  was  used  or  retrieved  in  a given  search.  A list  of  gi  identifiers 
representing  the  all  of  the  protein  sequences  used  in  this  study  is  shown  in  Table  A-l . 


Table  A-l . List  of  gi  identifiers  used  in  this  study 


206 

268 

270 

272 

382 

2056 

3797 

4437 

9555 

10439 

15268 

15518 

15584 

21311 

29626 

29806 

30003 

31676 

31869 

33801 

35865 

35869 

35907 

37058 

37070 

37433 

37547 

37615 

39312 

39785 

40218 

40944 

40992 

40998 

41025 

41097 

41106 

41116 

41134 

41188 

41195 

41423 

41441 

41483 

41484 

41505 

41583 

41611 

41646 

41698 

41718 

41736 

42000 

42011 

42012 

42034 

42046 

42048 

42161 

42175 

42370 

42607 

42740 

42839 

43066 

43107 

43234 

43268 

43601 

43605 

43608 

43615 

45756 

47642 

48184 

48240 

48666 

48866 

48867 

48869 

48907 

49306 

56297 

56756 

61546 

75387 

142853 

143020 

143279 

143282 

143421 

143784 

143808 

144886 

145163 

145226 

145368 

145393 

145398 

145432 

145470 

145540 

145556 

145643 

145654 

145681 

145682 

145710 

145714 

145720 

145729 

145753 

145763 

145787 

145827 

145853 

145887 

145898 

145977 

145981 

145983 

146027 

146220 

146835 

146954 

147100 

147162 

147175 

147256 

147377 

147412 

147426 

147516 

147555 

147576 

147780 

147932 

148060 

148776 

148777 

148801 

148802 

148803 

148808 

149410 

149542 

149888 

151184 

151259 

151274 

151395 

151506 

151922 

152788 

153138 

153205 

153214 

153613 

153695 

153809 

153903 

154090 

154385 

154430 

154437 

155086 

155221 

155680 

159812 

159816 

162713 

162899 

162959 

163042 

164621 

164641 

164842 

166398 

169093 

169973 

135 


170996 

171625 

172191 

172895 

178518 

180124 

183002 

183867 

183892 

189234 

190007 

190674 

204460 

205324 

206495 

214739 

215107 

215111 

215546 

215883 

216012 

229684 

238786 

239721 

285977 

286015 

286029 

293688 

295189 

295420 

304097 

304131 

304919 

307094 

307095 

307130 

310124 

313747 

337350 

338263 

338678 

340074 

396145 

396380 

396823 

410689 

413949 

414576 

416295 

417620 

433003 

437924 

438652 

440161 

443500 

449618 

460146 

487425 

494219 

498141 

515247 

525299 

525308 

551606 

559694 

563780 

581088 

581202 

581812 

606334 

609306 

623042 

717146 

732618 

732685 

791053 

805040 

805054 

868008 

882522 

882591 

887615 

887843 

927292 

987087 

1001238 

1001695 

1072405 

1073258 

1086494 

1155066 

1161934 

1163896 

1237213 

1262202 

1305499 

1419769 

1420596 

1421148 

1491704 

1499712 

1590840 

1591172 

1591284 

1591412 

1592204 

1651343 

1651356 

1736739 

1736758 

1742408 

1799827 

1800087 

1827608 

1906757 

1906778 

1916067 

1942155 

1942201 

1942210 

1942994 

1943527 

1943556 

2266668 

2266738 

2281998 

2415721 

2463196 

2546888 

2621990 

2622282 

2622705 

172213 

172488 

172709 

180626 

180677 

182080 

184565 

186513 

187391 

190722 

199993 

200644 

209832 

210812 

214000 

215133 

215160 

215164 

216671 

217063 

217434 

243494 

245804 

247169 

288310 

290549 

292363 

296636 

296668 

298023 

304928 

304930 

304991 

309091 

309108 

309217 

337489 

337934 

338045 

349225 

349834 

392433 

397407 

398026 

402674 

414585 

414887 

415896 

433630 

433687 

434995 

442667 

442955 

443251 

463272 

466772 

473801 

506848 

507852 

509819 

530009 

531395 

533783 

578126 

580330 

581045 

600065 

600619 

606066 

624186 

642167 

683460 

733518 

757826 

757914 

841164 

868006 

868007 

882639 

882640 

886887 

940148 

949838 

974607 

1004217 

1050470 

1050776 

1127200 

1128948 

1143611 

1174158 

1174227 

1217976 

1310704 

1353880 

1419535 

1469147 

1469866 

1480221 

1590963 

1591076 

1591102 

1591748 

1591880 

1592074 

1651573 

1654355 

1655500 

1742409 

1773109 

1799615 

1827808 

1839555 

1857637 

1938363 

1942048 

1942113 

1942494 

1942812 

1942990 

1944495 

2098420 

2209283 

2344871 

2392392 

2411490 

2578816 

2621719 

2621763 

2622740 

2622830 

2623858 

136 


2624505 

2624810 

2648886 

2897751 

2970047 

2981722 

3002951 

3114398 

3123494 

3318884 

3402115 

3434969 

3721880 

3891785 

3891901 

4062530 

4062642 

4062644 

4416541 

4519420 

4519421 

4566497 

4584712 

4589058 

4981177 

4981589 

5019732 

5304875 

5410324 

5531809 

5542521 

5542588 

5640135 

5853049 

6009517 

6015488 

6435581 

6435742 

6449389 

6573631 

6573673 

6682816 

6730014 

6739547 

6850960 

7245543 

7245699 

7466707 

7767190 

8843975 

8896014 

9945006 

9955024 

9955231 

10120933 

10120937 

10120943 

10186007 

10835722 

11079641 

11513561 

11513902 

11513939 

12084111 

12084368 

12084491 

12084765 

12214185 

12620398 

13096505 

13096616 

13096692 

13399750 

13399859 

13399953 

13446666 

13446668 

13621069 

14250534 

14277727 

14277771 

14278518 

14488686 

14719479 

15825931 

15825933 

15825935 

15825954 

15825967 

15825969 

15826199 

15826200 

15826202 

15826537 

15826607 

15988002 

15988379 

15988492 

16974818 

17943113 

17943201 

17943202 

17943375 

18158642 

18158651 

20149818 

20149886 

20150134 

20150562 

20150577 

20150581 

2649709 

2828820 

2897118 

2981783 

2982091 

2982984 

3127934 

3212494 

3257413 

3434984 

3641318 

3688184 

3891916 

3892640 

3980232 

4138422 

4388979 

4416324 

4519424 

4558012 

4558028 

4902922 

4927727 

4980718 

5107542 

5107694 

5107700 

5542048 

5542143 

5542358 

5822223 

5822326 

5822486 

6137465 

6166493 

6435576 

6518406 

6524010 

6573585 

6692578 

6729756 

6729958 

6980697 

7245353 

7245542 

7688345 

7711136 

7767041 

9257125 

9453869 

9651162 

9955257 

10120836 

10120848 

10120945 

10120978 

10120996 

11125386 

11137550 

11385418 

11514025 

11514604 

11514649 

12084602 

12084673 

12084680 

12653431 

12654441 

12803215 

13097744 

13128927 

13162209 

13400003 

13436185 

13446664 

13786942 

13786981 

14043991 

14277813 

14278149 

14278236 

14719481 

14719779 

15077800 

15825936 

15825937 

15825950 

15826050 

15826097 

15826113 

15826229 

15826398 

15826399 

15988023 

15988170 

15988176 

16974931 

16975220 

17942636 

17943203 

17943204 

17943205 

18158779 

18160971 

18655438 

20150355 

20150361 

20150561 

20150702 

20151101 

20151110 

# ! /usr /bin/perl  -w 
use  strict; 
use  DBI ; 


137 


my  $inf ile  = "test.txt"; 
my  $line  = 0; 
my  $pref ix  = ' fname 

open  (IF,  $infile)  ||  die  "cannot  open  \ " $inf ile\ ” : $!"; 
open  (OF2,  ">/export/people/dekee/Results/temp. fname" ); ’ 
while  ($line  = <IF>)  { 
chomp  ($line) ; 

print  OF2  "$prefix  ' $line'\;"; 

#close  ( OF2 ) ; 


####################################################################### 
#mySQL  password  must  be  the  arguments  perl  file.pl  password  > 
output . out 

####################################################################### 
my  $dbhl  = DBI->connect ( "DBI :mysql :MC : sql " , "dekee",  $ARGV[0] ) or  die 

####################################################################### 

# Get  all  Families  for  the  probe  sequence 

####################################################################### 

my  ©FamList  = ( ) • 
my  $Sth  = $dbhl->prepare ( 

"SELECT  DISTINCT  Famld  FROM  Module  WHERE  Seqld  = ? AND  Catld  = 3 " 

. "AND  MSAGaps  Is  NOT  NULL  ORDER  BY  SeqStart")  or  die  $!; 
$Sth->execute ( $line  ) or  die  $!; 
while  (my  ( $famid  ) = $Sth->f etchrow_array  ) { 
push  ©FamList,  $famid; 

} 

my  $FamList  = join  *,",  ©FamList; 

####################################################################### 

# Get  all  Sequences  for  all  families 

####################################################################### 

my  %SeqCount  = ( ) ; 

$Sth  = $dbhl->prepare ( 

"SELECT  DISTINCT  Seqld  FROM  Module  WHERE  Famld  = ? AND  Catld  = 3" 

) or  die  $ ! ; 

foreach  my  $famid  ( ©FamList  ) { 

$Sth->execute ( $famid  ) or  die  $!; 

while  ( my  ( $id  ) = $Sth->f etchrow_array  ) { 

$SeqCount{  $id  }++; 

} 

} 


138 


####################################################################### 

# Remove  seqs  w/o  exact  # of  families  as  the  target  sequence 

####################################################################### 

my  %SeqStart  = ( ) ; 

my  %SeqPad  = (); 

my  $SSPad  = 0; 

my  %Seq  = ( ) ; 

$Sth  = $dbhl->prepare ( 

"SELECT  Sequence  FROM  AASequence  WHERE  Id  = ?")  or  die  $!; 
while  ( my  ( $id,  $count  ) = each  %SeqCount  ) { 
if  ( $count  ==  $#FamList  + 1 ) { 

$Sth->execute ( $id  ) or  die  $!; 

( $Seq{  $id  } ) = $Sth->f etchrow_array; 

$SeqStart{  $id  } = 0; 

$SeqPad  { $id  } = 0; 

} else  { 

delete  $SeqCount{  $id  }; 

} 

} 

####################################################################### 

# Prepare  output  for  Darwin 

####################################################################### 

#first  delete  previously  made  tree 

system ( "rm  /export/people/dekee/Results/temp . db. tree" ) ; 

open  (OF,  ">/ export/people/dekee/Results/temp . db'1 ) ; 
while  ( my  ( $id,  $seq  ) = each  %Seq  ) { 

print  OF  "<E>\n<ID>$id</ID>\n<SEQ>$seq</SEQ>\n</E>\n" ; 

close  OF; 

system ( " StartDarwin  * ) ; 

} # end  of  initial  while  loop 

close  (OF2 ) ; 
close  (IF) ; 


Figure  A-2.  PrepareSeqIds.pl 


139 


MakeMAOutput  :=  proc(DB:  anything,  fn:  string) 

global  DB,  AllAll , Dist,  Var,  Names, fn; 

ReadLibrary ( ' MAlignment ' ) ; 

ReadLibrary ( ' GapHeuristics ' ) ; 

CreateDayMatrices ( ) ; 

ReadLibrary ( 1 ProbModel ' ) ; 

Set (printgc=false) ; 

ne  :=  DB [TotEntries] ; 

AllAll  :=  CreateArray (1 . .ne,  1 . .ne)  ,- 
Dist  :=  CreateArray (1 . ,ne,l. .ne) ; 

Var  :=  CreateArray (1 . .ne, 1 . .ne) ; 
ok  :=  false; 
for  i to  ne  do 

for  j from  i+1  to  ne  do 
m :=  Match (Entry (i , j )) ; 
m :=  LocalAlignBestPam(m) ; 

if  m[Sim]  < 80  then  m[PamNumber]  :=  500;  m[PamVariance]  :=  10^8 
else  ok  :=  true  fi; 

AllAll [i , j ] :=  AllAll [j,i]  :=  m; 

Dist[i,j]  :=  Dist[j,i]  : = m[PamNumber] ; 

Var[i,j]  :=  Var[j,i]  :=  m[PamVariance] 
od 
od; 

ofile  :=  '/export/people/dekee/Results/'.fn; 

OpenAppending (of ile) ; 

Names  :=  CreateArray (1. .ne) : 

for  i to  ne  do  Names [i]  :=  DB [Entry, i]  od; 

Names  :=  CrossRef erence (Names) : 

#finding  probe  sequence  (fn)  in  DB 
probeldx  : = 0 ; 

for  x from  1 to  DB [TotEntries]  do 

if  (fn  = Entry (x) [ID])  then  probeldx  :=  x: 

fi: 

od: 

# finding  sequence  'name'  in  Names  (e.g.,  a,  b,  c,  ...) 
probelD  :=  Names [probeldx] : 

#f or  x from  1 to  length (Names)  do 
♦print (Names [x] ) ; 

#od; 

tree  :=  MinSquareTree (Dist , Var , Names ) ; 
tt  :=  ConvertTree (tree) : 

Ptt  :=  ProbTree (tt) : 

printf ( ' \n \n\nMultiple  Alignment : \n\n ’ ) ; 

MultiplePrint (Ptt, [Infix(tree) ] , probelD) ; 


140 


OpenWriting (terminal) ; 

return (AllAll) ; 

end: 

# MultiplePrint ( Texts : array (string) , Names : array (string) ( idx:string  ) 

^ the  result  of  a multiple  alignment  (labeled  by  Names) 

# Gaston  H.  Gonnet  (Feb  15,  1991) 

MultiplePrint  :=proc(  Ptt: [array (array) , array (string) ] , 

Names : array ( string) , idx:string  ) 

PV  :=  Ptt [1 ] ; Texts  :=  Ptt  [2]; 
n :=  length (Texts) ; 

It  :=  length (Texts [1] ) ; 
probelD  :=  idx: 

if  n <>  length (Names)  then  error (' length  mismatch')  fi; 

#f ind  probelD  in  Names  array  (b/c  Infix  [see  fxn  call]  rearranges) 
probeldx  : = 0 : 

for  x from  1 to  length (Names)  do 

if  (probelD  = Names [x] ) then  probeldx  :=  x: 
fi: 
od: 

# Compute  the  bottom  line 
botlin  :=  CreateString (It) ; 
for  i to  It  do 

for  j from  2 to  n while  Texts [j,i]  = Texts [1 , i]  do  od; 
if  j>n  then  botlin [i]  := 

elif  MostProbAA(PV[i] /sum(PV[i] ) ) <>  then  botlin [i]  :=  fi 
od; 

if  assigned (MTitle)  then  printf ( 'Multiple  alignment  for  %s\n\n', 

MTitle)  fi; 
lprint (date ( ) ) ; 

width  :=  Set (screenwidth=80 ) ; Set (screenwidth=width) ; 
width  :=  width- 6; 
maxwidth  : = 0 ; 
lines  :=  3; 

for  i to  n do  maxwidth  :=  max (maxwidth, It)  od; 

for  origoffs  by  width  to  maxwidth  do 

Printf ( ' \n  %d  .. %d\n ', origof fs, origof fs+width-1) ; 
for  j to  n do 

printf ( ’%2. 2s  - %s\n',  Names[j], 

Texts [ j,  origoffs  ..  min (origof fs+width-1, It) ] ) 
od; 

printf ( ' %s\n’,  botlin[  origof fs . .min (origof fs+width-1, It)  ] ); 

# after  printing  bottom  line,  re-print  sequence  indicated  by  probeldx 

# printf ( '%2. 2s  - %s\n'.  Names [probeldx] , 

# Texts [probeldx,  origoffs  ..  min (origof fs+width-1 , It) ] ) ; 
lines  :=  lines  + n + 2; 

# if  lines  >=  61-n  and  n <=  61  then  printf (' ~L\n ') ; lines  :=  0 fi 
od; 

printf ( ' \n \n' ) ; 

NULL 

end: 


Figure  A-3.  darwin.ma 


141 


# ! /usr /bin/perl  -w 

use  DBI ; 

use  DP_dekee; 


&DP_param ( " TYPE " =>  " SEMIGLOBAL " ) ; 


# Surface  Areas  for  the  amino  acids:  www.imb-jena.de/IMAGE  AA.html 


my  %sa  = () ; 

$sa{ "A" } = 115;  $sa{ “R" } = 225; 
$sa{ "C" } = 135;  $sa{"E" } = 190; 
$sa{ "H" } = 195;  $sa{"I"}  = 175; 
$sa{ "M" } = 185;  $sa{ "F"  } = 210; 
$sa{ "T" } = 140;  $sa{ "W" } = 255; 

my  $infile  = "test.txt"; 
my  $seqid  = 0; 


$sa{ 

"D" } 

= 150; 

$sa{ 

"N" } 

= 160; 

$sa{ 1 

"Q”) 

= 180; 

$sa{ 

"G" } 

= 75; 

$sa{ 1 

"L" } 

= 170; 

$sa{ 1 

“K" } 

= 200; 

$sa{ 1 

"P” } 

= 145; 

$sa{ 1 

" S " } 

= 115; 

$sa{ 1 

■Y"  } 

= 230; 

$sa{ 1 

"V"  } 

= 155; 

open  (TF,  $infile)  ||  die  "cannot  open  \"$infile\":  $ ! " ; 
while  ($seqid  = <TF>)  { 
chomp  ($seqid) ; 

# print  "Current  Seqld  = $seqid\n"; 

open  (OF,  ">/ export/people/dekee/Results/$seqid. dssp" ) ; 
open  (OF2 , ">/export/people/dekee/Results/$seqid. info" ) ; 

####################################################################### 
#mySQL  password  is  the  arguments  perl  file.pl  password  > output  out 

####################################################################### 
my  $dbhl=DBI->connect ( "DBI :mysql :MC: sql" , "dekee",  $ARGV[0] ) or  die  $!; 

####################################################################### 

# Get  all  Families  for  the  probe  sequence 

####################################################################### 
my  $ Probe  = 0; 

my  %Seq  = ( ) ; 

my  $Sth  = $dbhl->prepare ( 

"SELECT  Sequence  FROM  AASequence  WHERE  Id  = ?")  or  die  $!; 
$Sth->execute ( $seqid  ) or  die  $!; 
while  (my  ( $Probe  ) = $Sth->f etchrow_array  ) { 
print  OF2  "Probe: \n$Probe\n" ; 

$Seq  { $Probe  } = 0; 

} 


####################################################################### 
# Get  the  PDB  Name 

####################################################################### 
my  $PDB  = " " ; 

#$PDBName  = " " ; 

$Sth  = $dbhl->prepare ( 

"SELECT  DISTINCT  Description  FROM  SeqAnnotation  WHERE  Seqld  = ? AND  " 
. "Description  Is  NOT  NULL  and  Type  between  30001  and  30006")  or  die 
$!  ; 

$Sth->execute ( $seqid  ) or  die  $!; 

$PDB  = $Sth->f etchrow_array; 

#$PDB  =~  s/\[//g;  $PDB  =~  s/\]//g; 

#$PDBName  = lc  substr  $PDB,  0,  4; 

( $PDB  ) = ( $PDB  =~  /\s*\[ ( \ S { 4 } ) . * \ ] / ) ; 

$PDB  = lc  $ PDB ; 


142 


print  0F2  " PDB  Name  = $PDB\n"; 

open  (IF,  " zcat  /mirror/dssp/$PDB.dssp.Z | " ) or  die  $!; 

#$line  = <IF>  until  $line  =~  /A  #/; 

$line  = <IF>; 

while  ($line  r #/)  { $line  = <IF>  } 

my  @PDBseq  = ( ) • 
my  @PDBstr  = ( ) ■ 
my  @PDBasa  = ( ) ■ 
my  @PDBsis  = (); 
my  $pat  = "SYNTHETIC"; 

while  ($line  = <IF>)  { 
chomp  $line; 

# if  ($line  =~  /$pat/)  { print  OF  "This  is  a synthetic  protein. \n" 
my  $aa  = substr  $line,  13,  1; 

next  if  " iXxbjouz"  =~  /$aa/;  #deals  with  non-standard  AA's 
push  (@PDBseq, $aa) ; 

my  $str  = substr  $line,  16,  1; 
if  ($str  eq  ’T')  {$str  = ' C ' } 

if  ($str  eq  'S’)  {$str  = ■ C ' } 

if  ( $str  eq  ' ')  {$str  = • C ' } 

if  ($str  eq  'G')  {$str  = • H ' } 

if  ($str  eq  ’B1)  {$str  = ’E'} 

push  (SPDBstr, $str) ; 

my  $acc  = substr  $line,  35,  3; 

my  $accnorm  = int(10*($acc  / $sa{$aa}));  #/ 

if  ($accnorm  > 7)  {$ASA="S";}  elsif  ($accnorm  > 4)  {$ASA="s"*} 

elsif  ($accnorm  > 1)  {$ASA="i";}  else  {$ASA="I";} 

push  (@PDBsis, $ASA) ; 

if  ($accnorm  >=  10)  {$accnozm  = 9} 

push  (@PDBasa, $accnorm) ; 


print  OF  "$_"  foreach  gPDBseq;  print  OF  "\n"; 
#print  0F2  foreach  @PDBseq;  print  OF2  "\n"; 

my  $str2  = join  "",  ©PDBseq; 
my  $strl  = &GetMSAl ( $seqid) ; 

my  $modl  = &DP_darwin ($strl,  $str2) ; 

# print  OF  " \nstrl [$strl] \nstr2 [$str2] \n" ; 

# print  OF  " $modl->{ SCORE } \n " ,- 

# print  OF  " $modl->{SEQl} \n" ; 

# print  OF  "$modl->{MID}\n" ; 
print  OF  " $modl-> { SEQ2 } \n" ; 

#print  OF2  foreach  OPDBstr;  print  0F2  "\n"; 

my  $structure  = join  "",  @PDBstr; 

#print  OF2  "$_"  foreach  @PDBasa;  print  0F2  " \n"; 
my  $surface  = join  "",  @PDBasa; 

#print  OF2  ”$_"  foreach  SPDBsis;  print  0F2  "\n"; 


143 


my  $ssandis  = join  " " , @PDBsis; 

#this  next  chunk  will  adjust  (insert  appropriate  gapping  into  the 
# structure  string,  asa  string  and  the  sis  string 

for  (0  ..  length  ($modl->{SEQ2 } ) - 1)  { 

substr ( $structure,  if  substr ($modl->{SEQ2 } , $_, 1)  eg 

/ 

substr  ($surface,  $_,(),  "-"i  if  substr  ($modl->{SEQ2}  1)  eg  "• 

/ 

substr ($ssandis,  $_, 0,"-")  if  substr ($modl->{SEQ2) ,$_, 1)  eg  "■ 

/ 

} 

print  OF  " $structure\n" ; 
print  OF  "$surface\n" ; 
print  OF  " $ssandis\n" ; 

close  (OF) ; 
close  (IF) ; 

} 

close  (TF) ; 

sub  GetMSAl { # grabs  the  probe,  gapped  seg  (string  1)  known  as 

$segid.msa 

my  ( $segid  ) = @_; 
my  $strl  = " " ; 

open  ( STR1 , " / export/people/ dekee/Results/$segid.msa" ) or  die  $!; 
#read  in  segl  (gapped  from  MSA) 

$strl  = <STR1>; 
chomp  $strl; 
close  (STR1); 
return  $strl; 

i 

Figure  A-4.  GetDSSP.pl 


144 


# ! /usr/bin/perl  -w 
use  DBI ; 
use  MSA; 
use  DSSP ; 
use  MySQL; 

open  (IF,  "test.txt")  or  die  $!; 
while  ($seqid  = <IF>)  { 
chomp  $seqid; 

my  ( $gi  ) = &MySQLrip  ( $seqid  ) ; 

my  ( $pam,  $maxpam  ) = &MSApam  ( $seqid  ) ; 

my  ( $letter , $aligned,  $lengthseq  ) = &MSArip  ( $seqid  ); 

my  ( $PDB,  $sequence,  $structure,  $surface,  $ssandis  ) = &DSSPrin  ( 
$seqid  ) ; 

my  ( $CRYSEQ,  $STRSEQ,  $SURSEQ,  $SISSEQ  ) = &DSSPgap  ( $sequence 
$structure, 

$surface,  $ssandis,  $aligned,  $lengthseq  ) ; 

#print  " $_\n"  foreach  @$CRYSEQ; 

my  $ln  = 0; 

my  $line  = 0; 

my  %seq  = ( ) ; 

my  $SEQpreface 

my  $STRpreface  = " \$  - "; 

my  $SURpreface 

my  $SISpreface  = 11  @ - " • 

first  step:  just  grab  the  $seq{$l}  .=  $2  in  order  to  print 
@$CRYSEQ 

open  (MSA,  ” $seqid” ) or  die  $!; 
while  ($ln  = <MSA>)  { 
chomp  $ln; 

do  { $ln  = <MSA> ; } until  ( $ln  =~  /"'Multiple/  ); 

$ln  = <MSA> ; $ln  = <MSA> ; 

while  ( ( $ln  = <MSA>  ) !~  /A $|"'"'L$/  ){ 

my  ( $key , $frag  ) = ( $ln  =~  /A\s* ( \S+) \s*-\s  ( . {1, 74} ) $/  ) ■ 
#sequence  line 

next  unless  defined  $key; 
next  unless  $key  eq  $ letter; 

$seq  { $key  } .=  $frag; 

} 

} 

close  ( MSA  ) ; 

my  $crystal  = join  @$CRYSEQ; 

my  $struct  = join  " " , @$STRSEQ; 
my  $surf  = join  @$SURSEQ; 

my  $ssis  = join  " " , @$SISSEQ; 


145 


my  ( $pad  ) = ( $seq  { $letter  } =~  /A(\s*)/  ); 


$crystal 

$struct 

$surf 

$ssis 

my  0CRYSEQ 
my  @STRSEQ 
my  @SURSEQ 
my  @SISSEQ 


$pad  . $crystal; 
$pad  . $struct; 
$pad  . $surf; 
$pad  . $ssis ; 

$ crystal  =~  / ( 

$struct  =~  / ( , 

$surf  =~  / ( . 

$ssis  =~  / ( . 


{1,74}) /g; 
{1,74}) /g; 
{1,74}) /g; 
{ 1 , 7 4 } ) / g ; 


##  second  step:  read  thru  MSA,  print  appropriate  lines  and  footer 
open  (OF,  ">/ export/people/ dekee/Results/$seqid.msa" ) ; 
print  OF  "Multiple  Sequence  Alignment:  $gi  [Seqld:  $seqid  PDB:  $PDB 
MaxPAM:  $maxpam]\n"; 


open  (MSA,  " $seqid” ) or  die  $!; 
while  ($line  = <MSA>)  { 
chomp  $line; 

do  { $line  = <MSA>;  } until  ( $line  =~  /''Multiple/ 
$line  = <MSA>;  $line  = <MSA>; 

while  ( ( $line  = <MSA>  ) !-  /' $|/'AL$/  ){ 

next  unless  $line  =~  /\S/; 

lines 

if  ( $line  =~  /A\s+ (\d+) \s+\ . \ . \d+/  ) { 
print  OF  " \n$line” ; 

nn . . nn 


#read  in  MSA 


#Skip  blank 


#header 


} 


} 


} elsif  ( $line  =~  /'\s*(\S+)\s 
print  OF  $line; 

} else  { print  OF  "$line\n" ; 
print  OF  $SEQpreface;  my  $xl 

print  OF  $STRpreface;  my  $x2 

print  OF  $SURpreface;  my  $x3  ■ 

print  OF  $SISpreface;  my  $x4  : 

} 


* 


-\s (.{1,74})$/ 


shift  @CRYSEQ; 
Shift  @STRSEQ; 
shift  SSURSEQ; 
shift  @SISSEQ; 


} 

close  (OF) ; 
close  (MSA) ; 


close  ( IF  ) ; 

Figure  A-5.  Master.pl 


{ 

#sequence 

#footer 

print  OF  " $xl\n"; 
print  OF  "$x2\n"; 
print  OF  " $x3 \n" ; 
print  OF  " $x4\n" ; 


APPENDIX  B 
MOTIF  DESCRIPTION 

The  data  displayed  for  each  helix,  Tables  B-l,  B-3,  and  B-6  respectively,  includes 
the  helix  number  (assigned  sequentially  starting  with  1 at  the  N-terminus  of  the  protein), 
the  residue  numbers  corresponding  to  the  start  and  end  of  the  helices,  and  the  helix  type 
H (or  helix)  or  G (3i0)  helix.  Information  about  the  geometry  of  the  helix  is  given  as 
follows:  length  (in  Angstroms),  the  number  of  residues  per  turn  (ideally  3.6  for  alpha 
helices),  and  a measure  of  the  deviation  of  the  helix  geometry  from  an  ideal  helix  (in 
degrees).  This  latter  value  should  be  0 for  a perfect  helix.  These  parameters  are  not 
calculated  for  helices  with  less  than  four  residues.  The  final  column  in  the  table  gives  the 
amino  acid  sequence  of  each  helix. 

The  second  table,  Tables  B-2,  B-4,  and  B-7  respectively,  gives  information  about 
interacting  pairs  of  helices  in  the  protein.  Pairs  are  included  in  the  table  if  one  or  both  of 
the  helices  is  in  the  current  polypeptide  chain.  The  interaction  type  describes  where  in 
each  of  the  helices  the  distance  of  closest  approach  occurs  (C:  beyond  the  C terminus  of 
the  helix,  N:  beyond  the  N terminus  of  the  helix,  I:  internal  to  the  helix).  The  number  of 
interacting  pairs  of  residues  and  the  number  of  residues  in  each  of  the  two  helices 
involved  in  the  interaction  are  also  given.  The  final  column  indicates  whether  the 
interaction  is  inter-  or  intrachain. 

For  each  beta  strand  in  the  protein  chain  a table.  Tables  B-5  and  B-8,  gives  the 
strand  number  (assigned  sequentially  from  the  N-terminus  of  the  protein),  the  start  and 


146 


147 


end  residues,  whether  the  strand  is  at  the  edge  of  a beta  sheet  or  not,  and  the  amino  acid 


sequence  for  each  strand. 


Table  B-l . Individual  helices  for  TO  129,  PDB  id:  lizm 


Helix 

Number 

Start 

End 

Type 

Length 

Residues 
per  turn 

Deviation 

1 

5 

14 

H 

15.53 

3.64 

3.8 

2 

21 

33 

H 

19.76 

3.67 

6.8 

3 

41 

49 

H 

13.23 

3.77 

11.6 

4 

60 

74 

H 

23.00 

3 . 66 

6.4 

5 

92 

113 

H 

32.99 

3.61 

14.9 

6 

117 

119 

G 

_ 

7 

122 

134 

H 

18.75 

3.62 

7.8 

8 

146 

170 

H 

36.80 

3.62 

14.6 

Sequence 
HSD-NQQLKS 
ATELHGFLSGLLC 
WLPLLYQFS 
VQPVTELYEQI SQTL 
VFTQADSLSDWANQFLL 
GIGLA 
LAK 

GE I GEAVDDLQD I 

EELAEALEEIIEYVRTI 

A-LFYSHF 


Table  B-2. 

Helix 

Numbers 

1 2 

1 4 

2 3 

2 4 

2 5 
2 8 

3 4 

3 5 

4 8 

5 7 

5 8 

7 8 


Helix  interactions  involving  1 izm 


Helix 

Types 

~h  ir 


Interaction  Number  of  Residues 
Type  Total  Helix  1 Helix  2 


H H 
H H 
H H 
H H 
H H 
H H 
H H 
H H 
H H 
H H 
H H 


I 

I 

I 

I 

I 

C 

I 

I 

c 

I 


N 

N 

I 

I 

I 

C 

I 

I 

I 

c 


N N 
I I 


7 

4 

8 

14 
25 

5 
5 
3 
3 
8 

15 
15 


4 

2 

4 
6 
9 
2 
3 
2 
2 

5 
8 
8 


3 

3 

4 
6 

10 

4 

4 

2 

2 

4 

9 

8 


Interaction 

Intrachain 

Intrachain 

Intrachain 

Intrachain 

Intrachain 

Intrachain 

Intrachain 

Intrachain 

Intrachain 

Intrachain 

Intrachain 

Intrachain 


Table  B-3.  Individual  helices  for  TO  148,  PDB  id:  linO 


Helix 

Number 

Start 

End 

Type 

Length 

Residues 
per  turn 

Deviation 

1 

12 

26 

H 

22.58 

3 . 60 

9.7 

2 

30 

32 

G 

- 

_ 

_ 

3 

55 

71 

H 

26.10 

3.54 

8.3 

4 

76 

78 

G 

- 

_ 

5 

104 

117 

H 

21.36 

3.61 

7.3 

6 

137 

149 

H 

20.06 

3.56 

4.6 

Sequence 

LHEVRNAVENANRVL 

YDF 

DFQLEQLIEILIGSCIK 

HSS 

TEMAKKITKLVKDS 

RDDLQAVIQLVKS 


Table  B-4.  Helix  interactions  involving  linO 

Helix  Helix  Interaction  Number  of  Residues 

Numbers  Types  Type  Total  Helix  1 Helix  2 

~ 3 H H I I To  5 5 

3 9 H H II  7 5 4 

5 6 H H I I 643 


Interaction 

Intrachain 

Interchain 

Intrachain 


148 


Table  B-5.  Individual  strands  for  T0148,  PDB  id:  linO 
Strand 


Number 

1 

2 

3 

4 

5 

6 

7 

8 
9 


Start 

End 

Edge 

Sequence 

3 

7 

No 

SFDIV 

37 

42 

Yes 

AVIELN 

47 

52 

No 

TIKITT 

79 

80 

Yes 

LD 

86 

88 

Yes 

EHH 

91 

98 

No 

LYSKEIKL 

122 

126 

Yes 

QTQIQ 

129 

133 

No 

QVRVT 

157 

162 

Yes 

QFNNFR 

Table  B-6.  Individual  helices  for  T0149,  PDB  id:  lini 

Residues 

Number  Start  End  Type  Length  per  turn  Deviation 


1 

37 

46 

H 

2 

65 

67 

G 

3 

69 

72 

G 

4 

85 

94 

H 

5 

114 

121 

H 

6 

122 

125 

G 

7 

171 

183 

H 

8 

191 

197 

H 

9 

217 

227 

H 

15.20 

3.56 

8.7 

7.10 

3.30 

41.9 

16.30 

3.58 

4.3 

12.49 

3.67 

12.7 

7.58 

3.44 

40.7 

20.01 

3.66 

5.4 

11.68 

3.52 

10.8 

17.84 

3.53 

11.2 

Sequence 

ILEHSVHALL 

FAQ 

PLAN 

RADSVLAGLK 

QDDLARLL 

ALSE 

RELLHDCLTRALN 

EASALEY 

PEDLALAEFYL 


Table  B-7.  Helix  interactions  involving  linj 


Helix 

Numbers 

Helix 

Types 

Interaction 

Type 

Number  of  Residues 
Total  Helix  1 Helix  2 

Interaction 

1 

3 

H 

G 

N 

N 

5 

4 

2 

Intrachain 

1 

5 

H 

H 

C 

I 

3 

2 

2 

Intrachain 

1 

9 

H 

H 

N 

I 

2 

2 

2 

Intrachain 

4 

7 

H 

H 

I 

I 

11 

5 

6 

Intrachain 

4 

8 

H 

H 

I 

N 

2 

2 

1 

Intrachain 

5 

6 

H 

G 

I 

I 

8 

4 

3 

Intrachain 

7 

8 

H 

H 

I 

I 

10 

6 

4 

Intrachain 

Table  B-8.  Individual  strands  for  T0149,  PDB  id:  linj 


Strand 

Number 

Start 

End 

Edge 

Sequence 

1 

8 

14 

No 

VCAWPA 

2 

51 

58 

No 

VKRWIAI 

3 

76 

80 

Yes 

ITWD 

4 

101 

104 

No 

WVLV 

5 

131 

136 

No 

GILAAP 

6 

141 

144 

Yes 

MKRA 

7 

151 

155 

Yes 

IAHTV 

8 

161 

170 

No 

WHALTPQFFP 

9 

203 

206 

Yes 

QLVE 

REFERENCES 


Alber,  T.  In  Prediction  of  Protein  Structure  and  the  Principles  of  Protein  Conformation • 
Fasman,  G.,  Ed.;  Plenum:  New  York,  1989. 

Anfinsen,  C.  B.;  Haber,  E.;  Sela,  M.;  White,  F.  H.  The  Kinetics  of  Formation  of  Native 
Ribonuclease  During  Oxidation  of  the  Reduced  Polypeptide  Chain.  Proc.  Natl  Acad  Sci 
USA  1961, 47,  1309-1314.  ' ’ ' 

Baker,  D;  Agard,  D.  A.  Kinetics  versus  Thermodynamics  in  Protein  Folding 
Biochemistry  1994,  33,  7505-7509. 

Benner,  S.  A.  Patterns  of  Divergence  in  Homologous  Proteins  as  Indicators  of  Tertiary 
and  Quaternary  Structure.  Adv.  Enzyme  Regul.  1989,  28,  219-236. 

Benner,  S.  A.;  Badcoe,  I.;  Cohen,  M.  A.;  Gerloff,  D.  L.  Bona  Fide  Prediction  of  Aspects 
ot  Protein  Conformation.  J.  Mol.  Biol.  1994,  235,  926-958. 

Benner,  S.  A.;  Cannarozzi,  G.;  Gerloff  D.;  Turcotte,  M.;  Chelvanayagam,  G.  Bona  Fide 
Predictions  of  Protein  Secondary  Structure  Using  Transparent  Analyses  of  Multiple 
Sequence  Alignments.  Chem.  Rev.  1997,  97,  2725-2843. 

Benner,  S.  A.;  Cohen,  M.  A.;  Gonnet,  G.  H.  Empirical  and  structural  models  for 

insertions  and  deletions  in  the  divergent  evolution  of  proteins.  J Mol  Biol  1993  229 
moo  • 


Benner,  S.  A.;  Gerloff,  D.  L.  Patterns  of  Divergence  in  Homologous  Proteins  as 
Indicators  of  Secondary  and  Tertiary  Structure.  The  Catalytic  Domain  of  Protein  Kinases 
Adv.  Enzyme  Regul.  1991,  37,  121-181. 

Benner,  S.  A.;  Gerloff,  D.  L.;  Chelvanayagam,  G.  The  Phospho-b-Galactosidase  and 
Synaptotagmin  Predictions.  Proteins  1995,  23,  446-453. 

Biou,  V.;  Gibrat,  J.F.;  Levin,  J.M.;  Robson,  B.;  Gamier,  J.  Secondary  Structure 
Prediction  - Combination  of  Three  Different  Methods.  Protein  Eng.  1988,  2,  185-191 

Blaber,  M.;  Zhang,  X.-J.;  Matthews,  B.  W.  Structural  Basis  of  Amino  Acid  Alpha  Helix 
Propensity.  Science  1993,  260,  1637-1 640. 

Bohm,  G.;  Jaenicke,  R.  Correlation  Functions  as  a Tool  for  Protein  Modeling  and 
Structure  Analysis.  Protein  Sci.  1992, 1,  1269-1278. 


149 


150 


Bowie,  J.  U.;  Luethy,  R.;  Eisenberg,  D.  A Method  to  Identify  Protein  Sequences  that  Fold 
into  a Known  Three-Dimensional  Structure.  Science  1991,  253,  164-170. 

Chakrabartty,  A.;  Baldwin,  R.  L.  Stability  of  Alpha-Helices.  Adv.  Protein  Chem  1995 
46,  141-176. 

Chothia,  C.  1000  families  for  the  molecular  biologist.  Nature  1992,  357,  543.544. 

Chothia,  C.;  Lesk,  A.  M.  The  Relation  between  the  Divergence  of  Sequence  and 
Structure  in  Proteins.  EMBOJ.  1986,  5,  823-826. 

Chou,  P.  Y.;  Fasman,  G.  D.  Prediction  of  Protein  Conformation.  Biochemistry  1974, 13, 


Chou,  P.  Y.;  Fasman,  G.  D.  Prediction  of  the  Secondary  Structure  of  Proteins  from  their 
Amino  Acid  Sequence.  Adv.  Enzymol.  Relat.  Areas  Mol.  Biol.  1978,  47,  45-148. 

Clark  D.  A ; Shirazi,  J.;  Rawlings,  C.  J.  Protein  Topology  Prediction  through  Constraint- 
Based  Search  and  the  Evaluation  of  Topological  Folding  Rules.  Protein  Eng.  1991,  4 


Cohen,  F.  E.;  Abarbanel,  R.  M.;  Kuntz,  I.  D.;  Fletterick,  R.  J.  Turn  Prediction  in  Proteins 
Using  a Pattern-Matching  Approach.  Biochemistry  1986,  25,  266-75. 

Colloc’h,  N.;  Etchebest,  C.;  Thoreau,  E.;  Henrissat,  B.;  Momon,  J.  P.  Comparison  of 
Three  Algorithms  for  the  Assignment  of  Secondary  Structure  in  Proteins:  The 
Advantages  of  a Consensus  Assignment.  Protein  Eng.  1993,  6,  377-382. 

Crawford,  I.  P.;  Niermann,  T.;  Kirschner,  K.  Prediction  of  Secondary  Structure  by 

Evolutionary  Comparison:  Application  to  the  Alpha  Subunit  of  Tryptophan  Synthase 
Proteins  1987, 2,  118-129.  P ayninase. 

Dayhoff,  M.  O.;  Schwartz,  R.  M.;  Orcutt,  B.  C.  In  Atlas  of  Protein  Sequence  and 
DcTh)78^  ^353  362  " Nati°nal  Biomedical  Research  Foundation:  Washington, 

Dodge,  R.  W.;  Laity,  J.  H.;  Rothwarf,  D.  M.;  Shimotakahara,  S.;  Scheraga,  H A Folding 
Pathway  of  Guanidine-Denatured  Disulfide-Intact  Wild-Type  and  Mutant  Bovine 
Pancreatic  Ribonuclease  A.  J.  Protein  Chem.  1994, 13,  409-421. 

Doolittle,  R.  F.,  Ed.;  Molecular  Evolution,  Computer  Analysis  of  Protein  and  Nucleic 
Acid  Sequences',  Academic  Press:  New  York,  1990. 

Dunbrack,  R.  L.;  Gerloff,  D.  L.;  Bower,  M.;  Chen,  X.  W.;  Lichtarge,  O.;  Cohen,  F.  E. 
Meeting  Review:  The  Second  Meeting  on  the  Critical  Assessment  of  Techniques  for 
Protein  Structure  Prediction  (CASP2).  Folding  Des.  1997,  2,  R27-R42. 


151 


Eisenberg,  D.;  Wesson,  M.;  Wilcox,  W.  In  Prediction  of  Protein  Structure  and  the 
Principles  of  Protein  Conformation ; Fasman,  G.,  Ed.;  Plenum:  New  York,  1989. 

Fasman,  G.  D.  Prediction  of  Protein  Structure  and  the  Principles  of  Protein 
Conformation ; Plenum;  New  York,  NY,  1989. 


Fauchere,  J.  L.;  Charton,  M.;  Kier,  L.  B.;  Verloop,  A.;  Pliska,  V.  Amino  Acid  Side  Chain 
Parameters  for  Correlation  Studies  in  Biology  and  Pharmacology.  Int.  J.  Peptide  Protein 
Res.  1988,  32,  269-278. 


Feng,  D.  F.;  Johnson,  M.  S.;  Doolittle,  R.  F.  Aligning  Amino  Acid  Sequences: 
Comparison  of  Commonly  Used  Methods.  J.  Mol.  Evol.  1985,  21,  1 12-125. 

Fifth  Community  Wide  Experiment  on  the  Critical  Assessment  of  Techniques  for  Protein 
Structure  Prediction.  http://predictioncenter.llnl.gov/casp5/Casp5.html  (accessed  Aug 
2003). 

Fitch,  W.  M.  An  Improved  Method  of  Testing  for  Evolutionary  Homology.  J.  Mol  Biol 
1966, 16,  9-16. 

Fratemali,  F.;  van  Gunsteren,  W.  F.  An  Efficient  Mean  Solvation  Force  Model  for  use  in 
Molecular  Dynamics  Simulations  of  Proteins  in  Aqueous  Solution.  J.  Mol.  Biol  1996 
256,  939-948. 

Gamier,  J.;  Osguthorpe,  D.  J.;  Robson,  B.  Analysis  of  the  Accuracy  and  Implication  of 
Simple  Methods  for  Predicting  the  Secondary  Structure  of  Globular  Proteins  J Mol 
Biol.  1978, 120,  97-120. 

Gerloff,  D.  L.;  Benner,  S.  A.  A Consensus  Prediction  of  the  Secondary  Structure  for  the 
6-P hospho-Beta-D-Galactosidase  Superfamily.  Proteins  1995,  21,  273-281. 


Gibson,  T.  J.;  Postma,  J.  P.;  Brown,  R.  S.;  Argos,  P.  A Model  for  the  Tertiary  Structure  of 
the  28  Residue  DNA-Binding  Motif  ('Zinc  Finger')  Common  to  many  Eukaryotic 
Transcriptional  Regulatory  Proteins.  Protein  Eng.  1988,  2,  209-218. 


Gonnet,  G.  H.;  Cohen,  M.  A.;  Benner,  S.  A.  Exhaustive  Matching  of  the  Entire  Protein 
Sequence  Database.  Science,  1992,  256,  1443-1445. 

Gonnet,  G.  H.;  Hallett,  M.  T.;  Korostensky,  C.;  Bemardin,  L.  Darwin  v.  2.0:  An 
Interpreted  Computer  Language  for  the  Biosciences.  Bioinformatics,  2000,  101-103. 

Guzzo,  A.  V.  The  Influence  of  Amino  Acid  Sequence  on  Protein  Structure.  Biophys  J 
1965,  5,  809-822. 

Hao,  M.  H.;  Scheraga,  H.  A.  How  Optimization  of  Potential  Functions  Affects  Protein 
Folding.  Proc.  Natl.  Acad.  Sci.  USA  1996,  93,  4984-4989. 


152 


Hartl,  D.  U.  Molecular  Chaperones  in  Cellular  Protein  Folding.  Nature  1996  381  571- 
579. 

Holley,  H.  W.;  Karplus,  M.  Protein  Secondary  Structure  Prediction  with  a Neural 
Network.  Proc.  Natl.  Acad.  Sci.  USA  1989,  86,  152-156. 


Horovitz,  A.;  Matthews,  J.  M.;  Fersht,  A.  R.  Alpha-helix  stability  in  proteins.  II.  Factors 
that  Influence  Stability  at  an  Internal  Position.  J.  Mol.  Biol.  1992,  227,  560  568. 

Jones,  D.  T.;  Taylor,  W.  R.;  Thornton,  J.  M.  The  Rapid  Generation  of  Mutation  Data 
Matrices  from  Protein  Sequences.  Comput.  Appl.  Biosci.  1992,  8,  275-82. 

Kabsch,  W.;  Sander,  C.  Dictionary  of  Protein  Secondary  Structure:  Pattern  Recognition 
of  Hydrogen-Bonded  and  Geometrical  Features.  Biopolymers  1983,  22,  2577-2637. 

Kabsch,  W;  Sander,  C.  On  the  Use  of  Sequence  Homologies  to  Predict  Protein  Structure: 
Identical  Pentapeptides  Can  Have  Completely  Different  Conformations.  Proc.  Natl. 
Acad.  Sci.  USA  1984,  81,  1075-1078. 

Kallenbach,  N.  R.;  Lyu,  P.;  Zhou,  H.  In  Circular  Dichroism  and  the  Conformational 
Analysis  of  Biopolymers\  Fasman,  G.  D.;  Ed;  Plenum:  New  York,  NY,  1996. 

Kimura,  M.  Molecular  Evolution,  Protein  Polymorphism,  and  the  Neutral  Theory, 
Springer- Verlag:  Berlin,  1982;  pp  3-56. 


Kinch,  L.  N.;  Qi,  Y.;  Hubbard,  T.  J.  P.;  Grishin,  N.  V.  CASP5  Target  Classification. 
Proteins:  Struct.  Func.  And  Genetics  2003,  53{S6),  340-351. 

King,  J.  L.;  Jukes,  T.  H.;  Non-Darwinian  Evolution.  Science  1969, 164,  788-798. 

Kolinski,  A.;  Skolnick,  Monte  Carlo  Simulations  of  Protein  Folding.  II.  Application  to 
Protein  A,  ROP,  and  Crambin.  J.  Proteins  1994, 18,  353-366. 

Lenstra,  J.  A.;  Hofsteenge,  J;  Beintema,  J.  J.  Invariant  Features  of  the  Structure  of 
Pancreatic  Ribonuclease.  A Test  of  Different  Predictive  Models  J.  Mol.  Biol  1977  109 
185-193. 

Levin,  J.M.;  Robson,  B.;  Gamier,  J.  An  Algorithm  for  Secondary  Structure 
Determination  in  Proteins  Based  on  Sequence  Similarity.  FEBS Lett.  1986,  205,  303-308. 

Levin,  J.M.;  Pascarella,  S.;  Argos,  P.;  Gamier,  J.  Quantification  of  Secondary  Structure 
Prediction  Improvement  using  Multiple  Alignments.  Protein  Eng.  1993,  6,  849-854. 

Levitt,  M.  Accurate  Modelling  of  Protein  Conformation  by  Automatic  Segment 
Matching.  J.  Mol.  Biol.  1992,  226,  507-533. 

Lim,  V.  I.  Algorithms  for  Prediction  of  Alpha  Helices  and  Beta  Structural  Regions  in 
Globular  Proteins.  J.  Mol.  Biol.  1974,  88,  873-894. 


153 


Lodish,  H.;  Baltimore,  D.;  Berk,  A.;  Zipursky,  S.  L.;  Matsudaira,  P.;  Darnell,  J. 
Molecular  Cell  Biology,  Third  Edition;  W.  H.  Freeman  and  Company:  New  York  NY 
1995. 

Mackay,  D.  H.  J.;  Cross,  A.  J.;  Hagler,  A.  T.  In  Prediction  of  Protein  Structure  and  the 
Principles  of  Protein  Conformation ; Fasman,  G.,  Ed.;  Plenum:  New  York,  1989. 

Matthews,  B.  W.;  Nicholson,  H.;  Becktel,  W.  Enhanced  Protein  Thermostability  from 
Site-Directed  Mutations  that  Decrease  the  Entropy  of  Unfolding.  J.  Proc.  Natl.  Acad  Sci 
USA  1987,  84,  6663-6667. 

Maxfield,  F.  R.;  Scheraga,  H.  A.  Improvements  in  the  Prediction  of  Protein  Backbone 
Topography  by  Reduction  of  Statistical  Errors.  Biochemistry  1979, 18,  697-704. 


McCammon,  J.  A.;  Wong,  C.  F.;  Lybrand,  T.  P.  In  Prediction  of  Protein  Structure  and 
the  Principles  of  Protein  Conformation ; Fasman,  G.,  Ed.;  Plenum:  New  York,  1989. 

McLachlan,  A.  D.  Repeating  Sequences  and  Gene  Duplication  in  Proteins.  J.  Mol.  Biol 
1972,  64,  417-37. 

Myers,  J.  K.;  Pace,  C.  N.;  Scholtz,  J.  M.  A Direct  Comparison  of  Helix  Propensity  in 
Proteins  and  Peptides.  Proc.  Natl.  Acad.  Sci.  USA  1997,  94,  2833-2837. 

Needleman,  S.  B.;  Wunsch,  C.  D.  A General  Method  Applicable  to  Search  for 
Similarities  in  the  Amino  Acid  Sequences  of  Two  Proteins.  J.  Mol.  Biol.  1970  48  443- 
453. 

Nishikawa,  K.  Assessment  of  Secondary-Structure  Prediction  of  Proteins  - Comparison 
of  Computerized  Chou-Fasman  Methods  with  Others.  Biochim.  Biophys.  Acta  1983  748 
285-299. 

O'Neil,  K.  T.;  DeGrado,  W.  F.  A Thermodynamic  Scale  for  the  Helix-Forming 
Tendencies  of  the  Commonly  Occurring  Amino  Acids.  Science  1990,  250,  646-651. 

Park,  B.;  Levitt,  M.  Energy  Functions  that  Discriminate  X-ray  and  Near-native  Folds 
from  Well  Constructed  Decoys.  J.  Mol.  Biol.  1996,  258,  367-392. 

Park,  S.-H.;  Shalongo,  W.;  Stellwagen,  E.  Residue  Helix  Parameters  Obtained  from 
Dichroic  Analysis  of  Peptides  of  Defined  Sequence.  Biochemistry  1993,  32,  7048-7053. 

Pascarella,  S.;  Argos,  P.  Analysis  of  Insertions/Deletions  in  Protein  Structures.  J.  Mol 
Biol.  1992,  224,  461-71. 

Pascarella,  S.;  Argos,  P.  Conservation  of  Amphipathic  Conformations  in  Multiple  Protein 
Structural  Alignments.  Protein  Eng.  1994,  7,  185-193. 


154 


Pauling,  L;  Corey,  R.  B.;  Branson,  H.  R.  The  Structures  of  Proteins:  Two  Hydrogen- 
Bonded  Helical  Configurations  of  the  Polypeptide  Chain.  Proc.  Natl.  Acad.  Sci.  USA 
1951,37,205-210. 

Protein  Stuff,  http://www.phys.psu.edu/~lezon/prot4.html  (accessed  February  20,  2004). 

Ramachandran,  G.  N.;  Sasisekharan,  V.  Conformation  of  Polypeptides  and  Proteins.  Adv. 
Protein  Chem.  1968,  23,  283-438. 

Reimer,  U.;  Fuellen,  G.  Biocomputing  in  a Nutshell,  http://www.techfak.uni- 
bielefeld.de/bcd/ForAll/Basics/welcome2.html  (accessed  November  18,  2003). 

Richards,  F.  M.;  Kundrot,  C.  E.  Identification  of  Structural  Motifs  from  Protein 
Coordinate  Data:  Secondary  Structure  and  First-Level  Supersecondary  Structure. 

Proteins  1988,  3,  71-84. 

Richards,  F.M.  The  Protein  Folding  Problem.  Sci.  Am.  1991,  264,  34-41. 

Rohl,  C.  R.;  Chakrabartty,  A.;  Baldwin,  R.  L.  Helix  Propagation  and  N-cap  Propensities 
of  the  Amino  Acids  Measured  in  Alanine-Based  Peptides  in  40  Volume  Percent 
Trifluoroethanol.  Protein  Sci.  1996,  5,  2623-2637. 

Rose,  G.  D.  Prediction  of  Chain  Turns  in  Proteins  on  a Globular  Basis.  Nature  1978  272 
586-590. 


Rose,  G.  D.;  Wetlaufer,  D.  B.  The  Number  of  Turns  in  Globular  Proteins.  Nature  1977 
268,  769-770. 

Rossman,  M.  G.;  Liljas,  A.;  Branden,  C.  I.;  Banaszak,  L.  J.  In  The  Enzymes',  Boyer,  P.  D.; 
Ed;  Academic  Press:  New  York,  NY,  1975;  Vol.  1 1,  pp.  61-102. 

Rost,  B.;  Sander,  C.  Combining  Evolutionary  Information  and  Neural  Networks  to 
Predict  Protein  Secondary  Structure.  Proteins  1994, 19,  55-72. 

Rost,  B.;  Sander,  C.;  Schneider,  R.  Redefining  the  Goals  of  Protein  Secondary  Structure 
Prediction.  J.  Mol.  Biol.  1994,  235,  13-26. 

Scheraga,  H.  A.  Structural  Studies  of  Ribonuclease  III.  A Model  for  the  Secondary  and 
Tertiary  Structure.  J.  Am.  Chem.  Soc.  1960,  82,  3847-3852. 


Schiffer,  C.  A.;  Caldwell,  J.  W.;  Stroud,  R.  M.;  Kollman,  P.  A.  Inclusion  of  Solvation 
Free  Energy  with  Molecular  Mechanics  Energy:  Alanyl  Dipeptide  as  a Test  Case.  Protein 
Sci.  1992, 1,  396-400. 

Schiffer,  M.;  Edmundson,  A.  B.  Use  of  helical  wheels  to  represent  the  structures  of 
proteins  and  to  identify  segments  with  helical  potential.  Biophys.  J.  1967,  7,  121-135. 


155 


Schulz,  G.  E.;  Schirmer,  R.  H.  Principles  of  Protein  Structure ; Springer- Verlag:  New 
York,  NY,  1979. 

Secondary  and  3D  Protein  Structure  Prediction  Evaluation  Info. 
http://predictioncenter.llnl.gov/casp5/pubResultS/CASP_BROWSER/ab- 
measures.html#SS_MEASURES  (accessed  February  20,  2004). 

Segovia,  L.  Protein  Structure  Prediction  on  the  Web.  Nature  Biotech.  1997, 15,  915. 

Shortle,  D.  Protein  Fold  Recognition.  Nat.  Struct.  Biol.  1995,  2,  91-93. 


Simons,  K.  T.;  Kooperberg,  C.;  Huang,  E.;  Baker,  D.  Assembly  of  Protein  Tertiary 
Structures  from  Fragments  with  Similar  Local  Sequences  using  Simulated  Annealing  and 
Bayesian  Scoring  Functions.  J.  Mol.  Biol.  1997,  268,  209-225. 

Sklenar,  H.;  Etchebest,  C.;  Lavery,  R.  Describing  Protein  Structure:  A General  Algorithm 
Yielding  Complete  Helicoidal  Parameters  and  a Unique  Helicoidal  Axis.  Proteins  1989 
6,  46-60. 

Smith  C.  K.;  Regan  L.  Construction  and  Design  of  Beta-Sheets.  Acc.  Chem.  Res.  1997 
30,  153-161. 

Smith,  T.  F.;  Waterman,  M.  S.  Identification  of  Common  Molecular  Sequences.  J.  Mol. 
Biol.  1981, 147,  195-197. 

Srinivasan,  R.;  Rose,  G.  D.  LINUS:  A Hierarchic  Procedure  to  Predict  the  Fold  of  a 
Protein.  Proteins  1995,  22,  81-99. 

Sternberg,  M.  J.;  Cohen,  F.  E.  The  Prediction  of  the  Secondary  and  Tertiary  Structures  of 
Interferon  from  Four  Homologous  Amino  Acid  Sequences.  Int.  J.  Biol.  Macromol  1982 
4,  137-144. 

Stowell,  M.  H.  B.;  Rees,  D.  C.  Structure  and  Stability  of  Membrane  Proteins.  Adv.  Prot. 
Chem.  1995,  46,279-311. 

Stryer,  L.  Biochemistry,  4th  Edition;  W.H.  Freeman  and  Company:  New  York,  NY 
1995. 

Taylor,  W.  R.  Protein  Fold  Refinement:  Building  Models  from  Idealized  Folds  using 
Motif  Constraints  and  Multiple  Sequence  Data.  Protein  Eng.  1993,  6,  593-604. 

Wako,  H.;  Blundell,  T.  L.  Use  of  Amino  Acid  Environment-Dependent  Substitution 
Tables  and  Conformational  Propensities  in  Structure  Prediction  from  Aligned  Sequences 
of  Homologous  Proteins.  I.  Solvent  Accessibility  Classes.  J.  Mol.  Biol.  1994a  238  682- 
692. 


156 


Wako,  H.;  Blundell,  T.  L.  Use  of  Amino  Acid  Environment-Dependent  Substitution 
Tables  and  Conformational  Propensities  in  Structure  Prediction  from  Aligned  Sequences 
of  Homologous  Proteins.  II.  Secondary  Structures.  J.  Mol.  Biol.  1994b,  238,  693-708. 

Wilmot,  C.  M.;  Thornton,  J.  M.  Analysis  and  Prediction  of  the  Different  Types  of  Beta- 
Turn  in  Proteins.  J.  Mol.  Biol.  1988,  203,  221-232. 

Wojcik,  J.;  Altman,  K.  H.;  Scheraga,  H.  A.  Helix  Coil  Stability-Constraints  for  the 
Naturally-Occurring  Amino-Acids  in  Water.  Biopolymers  1990,  30,  121-134. 

Zuckerkandl,  E.  Evolution  of  Hemoglobin.  Sci.  Am.  1965,  212,  110-118. 


Zvelebil,  M.  J.;  Barton,  G.  J.;  Taylor,  W.  R.;  Sternberg,  M.  J.  Prediction  of  protein 
secondary  structure  and  active  sites  using  the  alignment  of  homologous  sequences.  J 
Mol.  Biol.  1987,  195,  957-961. 


BIOGRAPHICAL  SKETCH 


Danny  W.  De  Kee  was  bom  on  May  1 1,  1973,  in  Ottawa,  Canada,  to  Daniel  and 
Sonja  De  Kee.  He  graduated  from  Bishop’s  College  School,  Lennoxville,  Canada,  in 
May  1991.  He  received  a bachelor’s  degree  in  chemistry  from  the  University  of  Ottawa 
in  1996.  He  started  his  graduate  training  in  1996  under  Professor  Mike  Zemer  in  the  area 
of  quantum  chemistry.  After  Professor  Zemer’s  untimely  death,  Danny  transferred  to 
Professor  Steven  Benner’s  group  to  pursue  a degree  in  the  area  of  bioinformatics. 


157 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  confonns  to  acceptable 
standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a 
dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Steven  A.  Benner,  Chair 
Distinguished  Professor  of  Chemistry 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  confonns  to  acceptable 
standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  qualitv,  as  a 
dissertation  for  the  degree  of  Doctor  of  Philo 


Associate  Professor  of  Chemistry 


1 certify  that  I have  read  this  study  and  that  in  my  opinion  it  confonns  to  acceptable 
standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a 
dissertation  for  the  degree  of  Doctor  of  Philosophy. 


ficole  A.  Horensteir 
Associate  Professor  of  Chemistry 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  acceptable 
standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a 
dissertation  for  the  degree  of  Doctor  of  Philosophy. 


N.  Yngve 
Professor 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  acceptable 
standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a 
dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Jofy*''R.  Sabin 
Professor  of  Physics 


This  dissertation  was  submitted  to  the  Graduate  Faculty  of  the  Department  of 
Chemistry  in  the  College  of  Liberal  Arts  and  Sciences  and  to  the  Graduate  School  and 
was  accepted  as  partial  fulfillment  of  the  requirements  for  the  degree  of  Doctor  of 
Philosophy. 


May  2004 


Dean,  Graduate  School 


